Learning Cross-Domain Correspondence for Control with Dynamics Cycle-Consistency2020-12-17 ${\displaystyle \cong }$ |

At the heart of many robotics problems is the challenge of learning correspondences across domains. For instance, imitation learning requires obtaining correspondence between humans and robots; sim-to-real requires correspondence between physics simulators and the real world; transfer learning requires correspondences between different robotics environments. This paper aims to learn correspondence across domains differing in representation (vision vs. internal state), physics parameters (mass and friction), and morphology (number of limbs). Importantly, correspondences are learned using unpaired and randomly collected data from the two domains. We propose \textit{dynamics cycles} that align dynamic robot behavior across two domains using a cycle-consistency constraint. Once this correspondence is found, we can directly transfer the policy trained on one domain to the other, without needing any additional fine-tuning on the second domain. We perform experiments across a variety of problem domains, both in simulation and on real robot. Our framework is able to align uncalibrated monocular video of a real robot arm to dynamic state-action trajectories of a simulated arm without paired data. Video demonstrations of our results are available at: https://sjtuzq.github.io/cycle_dynamics.html . |

A Hypergradient Approach to Robust Regression without Correspondence2020-11-30 ${\displaystyle \cong }$ |

We consider a regression problem, where the correspondence between input and output data is not available. Such shuffled data is commonly observed in many real world problems. Taking flow cytometry as an example, the measuring instruments are unable to preserve the correspondence between the samples and the measurements. Due to the combinatorial nature, most of existing methods are only applicable when the sample size is small, and limited to linear regression models. To overcome such bottlenecks, we propose a new computational framework - ROBOT- for the shuffled regression problem, which is applicable to large data and complex models. Specifically, we propose to formulate the regression without correspondence as a continuous optimization problem. Then by exploiting the interaction between the regression model and the data correspondence, we propose to develop a hypergradient approach based on differentiable programming techniques. Such a hypergradient approach essentially views the data correspondence as an operator of the regression, and therefore allows us to find a better descent direction for the model parameter by differentiating through the data correspondence. ROBOT is quite general, and can be further extended to the inexact correspondence setting, where the input and output data are not necessarily exactly aligned. Thorough numerical experiments show that ROBOT achieves better performance than existing methods in both linear and nonlinear regression tasks, including real-world applications such as flow cytometry and multi-object tracking. |

Matching neural paths: transfer from recognition to correspondence search2017-11-05 ${\displaystyle \cong }$ |

Many machine learning tasks require finding per-part correspondences between objects. In this work we focus on low-level correspondences - a highly ambiguous matching problem. We propose to use a hierarchical semantic representation of the objects, coming from a convolutional neural network, to solve this ambiguity. Training it for low-level correspondence prediction directly might not be an option in some domains where the ground-truth correspondences are hard to obtain. We show how transfer from recognition can be used to avoid such training. Our idea is to mark parts as "matching" if their features are close to each other at all the levels of convolutional feature hierarchy (neural paths). Although the overall number of such paths is exponential in the number of layers, we propose a polynomial algorithm for aggregating all of them in a single backward pass. The empirical validation is done on the task of stereo correspondence and demonstrates that we achieve competitive results among the methods which do not use labeled target domain data. |

Neural Non-Rigid Tracking2020-06-23 ${\displaystyle \cong }$ |

We introduce a novel, end-to-end learnable, differentiable non-rigid tracker that enables state-of-the-art non-rigid reconstruction. Given two input RGB-D frames of a non-rigidly moving object, we employ a convolutional neural network to predict dense correspondences. These correspondences are used as constraints in an as-rigid-as-possible (ARAP) optimization problem. By enabling gradient back-propagation through the non-rigid optimization solver, we are able to learn correspondences in an end-to-end manner such that they are optimal for the task of non-rigid tracking. Furthermore, this formulation allows for learning correspondence weights in a self-supervised manner. Thus, outliers and wrong correspondences are down-weighted to enable robust tracking. Compared to state-of-the-art approaches, our algorithm shows improved reconstruction performance, while simultaneously achieving 85 times faster correspondence prediction than comparable deep-learning based methods. |

Learning Two-View Correspondences and Geometry Using Order-Aware Network2019-08-14 ${\displaystyle \cong }$ |

Establishing correspondences between two images requires both local and global spatial context. Given putative correspondences of feature points in two views, in this paper, we propose Order-Aware Network, which infers the probabilities of correspondences being inliers and regresses the relative pose encoded by the essential matrix. Specifically, this proposed network is built hierarchically and comprises three novel operations. First, to capture the local context of sparse correspondences, the network clusters unordered input correspondences by learning a soft assignment matrix. These clusters are in a canonical order and invariant to input permutations. Next, the clusters are spatially correlated to form the global context of correspondences. After that, the context-encoded clusters are recovered back to the original size through a proposed upsampling operator. We intensively experiment on both outdoor and indoor datasets. The accuracy of the two-view geometry and correspondences are significantly improved over the state-of-the-arts. Code will be available at https://github.com/zjhthu/OANet.git. |

Space-Time Correspondence as a Contrastive Random Walk2020-06-25 ${\displaystyle \cong }$ |

This paper proposes a simple self-supervised approach for learning representations for visual correspondence from raw video. We cast correspondence as link prediction in a space-time graph constructed from a video. In this graph, the nodes are patches sampled from each frame, and nodes adjacent in time can share a directed edge. We learn a node embedding in which pairwise similarity defines transition probabilities of a random walk. Prediction of long-range correspondence is efficiently computed as a walk along this graph. The embedding learns to guide the walk by placing high probability along paths of correspondence. Targets are formed without supervision, by cycle-consistency: we train the embedding to maximize the likelihood of returning to the initial node when walking along a graph constructed from a `palindrome' of frames. We demonstrate that the approach allows for learning representations from large unlabeled video. Despite its simplicity, the method outperforms the self-supervised state-of-the-art on a variety of label propagation tasks involving objects, semantic parts, and pose. Moreover, we show that self-supervised adaptation at test-time and edge dropout improve transfer for object-level correspondence. |

Learning Deep Features for Shape Correspondence with Domain Invariance2021-02-20 ${\displaystyle \cong }$ |

Correspondence-based shape models are key to various medical imaging applications that rely on a statistical analysis of anatomies. Such shape models are expected to represent consistent anatomical features across the population for population-specific shape statistics. Early approaches for correspondence placement rely on nearest neighbor search for simpler anatomies. Coordinate transformations for shape correspondence hold promise to address the increasing anatomical complexities. Nonetheless, due to the inherent shape-level geometric complexity and population-level shape variation, the coordinate-wise correspondence often does not translate to the anatomical correspondence. An alternative, group-wise approach for correspondence placement explicitly models the trade-off between geometric description and the population's statistical compactness. However, these models achieve limited success in resolving nonlinear shape correspondence. Recent works have addressed this limitation by adopting an application-specific notion of correspondence through lifting positional data to a higher dimensional feature space. However, they heavily rely on manual expertise to create domain-specific features and consistent landmarks. This paper proposes an automated feature learning approach, using deep convolutional neural networks to extract correspondence-friendly features from shape ensembles. Further, an unsupervised domain adaptation scheme is introduced to augment the pretrained geometric features with new anatomies. Results on anatomical datasets of human scapula, femur, and pelvis bones demonstrate that features learned in supervised fashion show improved performance for correspondence estimation compared to the manual features. Further, unsupervised learning is demonstrated to learn complex anatomy features using the supervised domain adaptation from features learned on simpler anatomy. |

Unseeded low-rank graph matching by transform-based unsupervised point registration2018-07-12 ${\displaystyle \cong }$ |

The problem of learning a correspondence relationship between nodes of two networks has drawn much attention of the computer science community and recently that of statisticians. The unseeded version of this problem, in which we do not know any part of the true correspondence, is a long-standing challenge. For low-rank networks, the problem can be translated into an unsupervised point registration problem, in which two point sets generated from the same distribution are matchable by an unknown orthonormal transformation. Conventional methods generally lack consistency guarantee and are usually computationally costly. In this paper, we propose a novel approach to this problem. Instead of simultaneously estimating the unknown correspondence and orthonormal transformation to match up the two point sets, we match their distributions via minimizing our designed loss function capturing the discrepancy between their Laplace transforms, thus avoiding the optimization over all possible correspondences. This dramatically reduces the dimension of the optimization problem from $?(n^2)$ parameters to $O(d^2)$ parameters, where $d$ is the fixed rank, and enables convenient theoretical analysis. In this paper, we provide arguably the first consistency guarantee and explicit error rate for general low-rank models. Our method provides control over the computational complexity ranging from $?(n)$ (any growth rate faster than $n$) to $O(n^2)$ while pertaining consistency. We demonstrate the effectiveness of our method through several numerical examples. |

Pedestrian Tracking by Probabilistic Data Association and Correspondence Embeddings2019-07-16 ${\displaystyle \cong }$ |

This paper studies the interplay between kinematics (position and velocity) and appearance cues for establishing correspondences in multi-target pedestrian tracking. We investigate tracking-by-detection approaches based on a deep learning detector, joint integrated probabilistic data association (JIPDA), and appearance-based tracking of deep correspondence embeddings. We first addressed the fixed-camera setup by fine-tuning a convolutional detector for accurate pedestrian detection and combining it with kinematic-only JIPDA. The resulting submission ranked first on the 3DMOT2015 benchmark. However, in sequences with a moving camera and unknown ego-motion, we achieved the best results by replacing kinematic cues with global nearest neighbor tracking of deep correspondence embeddings. We trained the embeddings by fine-tuning features from the second block of ResNet-18 using angular loss extended by a margin term. We note that integrating deep correspondence embeddings directly in JIPDA did not bring significant improvement. It appears that geometry of deep correspondence embeddings for soft data association needs further investigation in order to obtain the best from both worlds. |

Metric-Based Imitation Learning Between Two Dissimilar Anthropomorphic Robotic Arms2020-02-25 ${\displaystyle \cong }$ |

The development of autonomous robotic systems that can learn from human demonstrations to imitate a desired behavior - rather than being manually programmed - has huge technological potential. One major challenge in imitation learning is the correspondence problem: how to establish corresponding states and actions between expert and learner, when the embodiments of the agents are different (morphology, dynamics, degrees of freedom, etc.). Many existing approaches in imitation learning circumvent the correspondence problem, for example, kinesthetic teaching or teleoperation, which are performed on the robot. In this work we explicitly address the correspondence problem by introducing a distance measure between dissimilar embodiments. This measure is then used as a loss function for static pose imitation and as a feedback signal within a model-free deep reinforcement learning framework for dynamic movement imitation between two anthropomorphic robotic arms in simulation. We find that the measure is well suited for describing the similarity between embodiments and for learning imitation policies by distance minimization. |

Self-supervised Learning for Video Correspondence Flow2019-07-27 ${\displaystyle \cong }$ |

The objective of this paper is self-supervised learning of feature embeddings that are suitable for matching correspondences along the videos, which we term correspondence flow. By leveraging the natural spatial-temporal coherence in videos, we propose to train a ``pointer'' that reconstructs a target frame by copying pixels from a reference frame. We make the following contributions: First, we introduce a simple information bottleneck that forces the model to learn robust features for correspondence matching, and prevent it from learning trivial solutions, \eg matching based on low-level colour information. Second, to tackle the challenges from tracker drifting, due to complex object deformations, illumination changes and occlusions, we propose to train a recursive model over long temporal windows with scheduled sampling and cycle consistency. Third, we achieve state-of-the-art performance on DAVIS 2017 video segmentation and JHMDB keypoint tracking tasks, outperforming all previous self-supervised learning approaches by a significant margin. Fourth, in order to shed light on the potential of self-supervised learning on the task of video correspondence flow, we probe the upper bound by training on additional data, \ie more diverse videos, further demonstrating significant improvements on video segmentation. |

Learning Inter-Modal Correspondence and Phenotypes from Multi-Modal Electronic Health Records2020-11-12 ${\displaystyle \cong }$ |

Non-negative tensor factorization has been shown a practical solution to automatically discover phenotypes from the electronic health records (EHR) with minimal human supervision. Such methods generally require an input tensor describing the inter-modal interactions to be pre-established; however, the correspondence between different modalities (e.g., correspondence between medications and diagnoses) can often be missing in practice. Although heuristic methods can be applied to estimate them, they inevitably introduce errors, and leads to sub-optimal phenotype quality. This is particularly important for patients with complex health conditions (e.g., in critical care) as multiple diagnoses and medications are simultaneously present in the records. To alleviate this problem and discover phenotypes from EHR with unobserved inter-modal correspondence, we propose the collective hidden interaction tensor factorization (cHITF) to infer the correspondence between multiple modalities jointly with the phenotype discovery. We assume that the observed matrix for each modality is marginalization of the unobserved inter-modal correspondence, which are reconstructed by maximizing the likelihood of the observed matrices. Extensive experiments conducted on the real-world MIMIC-III dataset demonstrate that cHITF effectively infers clinically meaningful inter-modal correspondence, discovers phenotypes that are more clinically relevant and diverse, and achieves better predictive performance compared with a number of state-of-the-art computational phenotyping models. |

Deep Graph Matching Consensus2020-01-27 ${\displaystyle \cong }$ |

This work presents a two-stage neural architecture for learning and refining structural correspondences between graphs. First, we use localized node embeddings computed by a graph neural network to obtain an initial ranking of soft correspondences between nodes. Secondly, we employ synchronous message passing networks to iteratively re-rank the soft correspondences to reach a matching consensus in local neighborhoods between graphs. We show, theoretically and empirically, that our message passing scheme computes a well-founded measure of consensus for corresponding neighborhoods, which is then used to guide the iterative re-ranking process. Our purely local and sparsity-aware architecture scales well to large, real-world inputs while still being able to recover global correspondences consistently. We demonstrate the practical effectiveness of our method on real-world tasks from the fields of computer vision and entity alignment between knowledge graphs, on which we improve upon the current state-of-the-art. Our source code is available under https://github.com/rusty1s/ deep-graph-matching-consensus. |

Learning Correspondence from the Cycle-Consistency of Time2019-04-02 ${\displaystyle \cong }$ |

We introduce a self-supervised method for learning visual correspondence from unlabeled video. The main idea is to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch. At training time, our model learns a feature map representation to be useful for performing cycle-consistent tracking. At test time, we use the acquired representation to find nearest neighbors across space and time. We demonstrate the generalizability of the representation -- without finetuning -- across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow. Our approach outperforms previous self-supervised methods and performs competitively with strongly supervised methods. |

Sample-to-Sample Correspondence for Unsupervised Domain Adaptation2018-12-04 ${\displaystyle \cong }$ |

The assumption that training and testing samples are generated from the same distribution does not always hold for real-world machine-learning applications. The procedure of tackling this discrepancy between the training (source) and testing (target) domains is known as domain adaptation. We propose an unsupervised version of domain adaptation that considers the presence of only unlabelled data in the target domain. Our approach centers on finding correspondences between samples of each domain. The correspondences are obtained by treating the source and target samples as graphs and using a convex criterion to match them. The criteria used are first-order and second-order similarities between the graphs as well as a class-based regularization. We have also developed a computationally efficient routine for the convex optimization, thus allowing the proposed method to be used widely. To verify the effectiveness of the proposed method, computer simulations were conducted on synthetic, image classification and sentiment classification datasets. Results validated that the proposed local sample-to-sample matching method out-performs traditional moment-matching methods and is competitive with respect to current local domain-adaptation methods. |

Look, Listen and Learn2017-08-01 ${\displaystyle \cong }$ |

We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself -- the correspondence between the visual and the audio streams, and we introduce a novel "Audio-Visual Correspondence" learning task that makes use of this. Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations. These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art self-supervised approaches on ImageNet classification. We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks. |

Deep Fundamental Matrix Estimation without Correspondences2018-10-02 ${\displaystyle \cong }$ |

Estimating fundamental matrices is a classic problem in computer vision. Traditional methods rely heavily on the correctness of estimated key-point correspondences, which can be noisy and unreliable. As a result, it is difficult for these methods to handle image pairs with large occlusion or significantly different camera poses. In this paper, we propose novel neural network architectures to estimate fundamental matrices in an end-to-end manner without relying on point correspondences. New modules and layers are introduced in order to preserve mathematical properties of the fundamental matrix as a homogeneous rank-2 matrix with seven degrees of freedom. We analyze performance of the proposed models using various metrics on the KITTI dataset, and show that they achieve competitive performance with traditional methods without the need for extracting correspondences. |

Learning Affective Correspondence between Music and Image2019-04-16 ${\displaystyle \cong }$ |

We introduce the problem of learning affective correspondence between audio (music) and visual data (images). For this task, a music clip and an image are considered similar (having true correspondence) if they have similar emotion content. In order to estimate this crossmodal, emotion-centric similarity, we propose a deep neural network architecture that learns to project the data from the two modalities to a common representation space, and performs a binary classification task of predicting the affective correspondence (true or false). To facilitate the current study, we construct a large scale database containing more than $3,500$ music clips and $85,000$ images with three emotion classes (positive, neutral, negative). The proposed approach achieves $61.67\%$ accuracy for the affective correspondence prediction task on this database, outperforming two relevant and competitive baselines. We also demonstrate that our network learns modality-specific representations of emotion (without explicitly being trained with emotion labels), which are useful for emotion recognition in individual modalities. |

Unsupervised Correlation Analysis2018-04-01 ${\displaystyle \cong }$ |

Linking between two data sources is a basic building block in numerous computer vision problems. In this paper, we set to answer a fundamental cognitive question: are prior correspondences necessary for linking between different domains? One of the most popular methods for linking between domains is Canonical Correlation Analysis (CCA). All current CCA algorithms require correspondences between the views. We introduce a new method Unsupervised Correlation Analysis (UCA), which requires no prior correspondences between the two domains. The correlation maximization term in CCA is replaced by a combination of a reconstruction term (similar to autoencoders), full cycle loss, orthogonality and multiple domain confusion terms. Due to lack of supervision, the optimization leads to multiple alternative solutions with similar scores and we therefore introduce a consensus-based mechanism that is often able to recover the desired solution. Remarkably, this suffices in order to link remote domains such as text and images. We also present results on well accepted CCA benchmarks, showing that performance far exceeds other unsupervised baselines, and approaches supervised performance in some cases. |

Coupled Clustering: a Method for Detecting Structural Correspondence2001-07-23 ${\displaystyle \cong }$ |

This paper proposes a new paradigm and computational framework for identification of correspondences between sub-structures of distinct composite systems. For this, we define and investigate a variant of traditional data clustering, termed coupled clustering, which simultaneously identifies corresponding clusters within two data sets. The presented method is demonstrated and evaluated for detecting topical correspondences in textual corpora. |