BWCP: Probabilistic Learning-to-Prune Channels for ConvNets via Batch Whitening2021-05-13 ${\displaystyle \cong }$ |

This work presents a probabilistic channel pruning method to accelerate Convolutional Neural Networks (CNNs). Previous pruning methods often zero out unimportant channels in training in a deterministic manner, which reduces CNN's learning capacity and results in suboptimal performance. To address this problem, we develop a probability-based pruning algorithm, called batch whitening channel pruning (BWCP), which can stochastically discard unimportant channels by modeling the probability of a channel being activated. BWCP has several merits. (1) It simultaneously trains and prunes CNNs from scratch in a probabilistic way, exploring larger network space than deterministic methods. (2) BWCP is empowered by the proposed batch whitening tool, which is able to empirically and theoretically increase the activation probability of useful channels while keeping unimportant channels unchanged without adding any extra parameters and computational cost in inference. (3) Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet with various network architectures show that BWCP outperforms its counterparts by achieving better accuracy given limited computational budgets. For example, ResNet50 pruned by BWCP has only 0.70\% Top-1 accuracy drop on ImageNet, while reducing 43.1\% FLOPs of the plain ResNet50. |

Exploiting Channel Similarity for Accelerating Deep Convolutional Neural Networks2019-08-06 ${\displaystyle \cong }$ |

To address the limitations of existing magnitude-based pruning algorithms in cases where model weights or activations are of large and similar magnitude, we propose a novel perspective to discover parameter redundancy among channels and accelerate deep CNNs via channel pruning. Precisely, we argue that channels revealing similar feature information have functional overlap and that most channels within each such similarity group can be removed without compromising model's representational power. After deriving an effective metric for evaluating channel similarity through probabilistic modeling, we introduce a pruning algorithm via hierarchical clustering of channels. In particular, the proposed algorithm does not rely on sparsity training techniques or complex data-driven optimization and can be directly applied to pre-trained models. Extensive experiments on benchmark datasets strongly demonstrate the superior acceleration performance of our approach over prior arts. On ImageNet, our pruned ResNet-50 with 30% FLOPs reduced outperforms the baseline model. |

DMCP: Differentiable Markov Channel Pruning for Neural Networks2020-05-07 ${\displaystyle \cong }$ |

Recent works imply that the channel pruning can be regarded as searching optimal sub-structure from unpruned networks. However, existing works based on this observation require training and evaluating a large number of structures, which limits their application. In this paper, we propose a novel differentiable method for channel pruning, named Differentiable Markov Channel Pruning (DMCP), to efficiently search the optimal sub-structure. Our method is differentiable and can be directly optimized by gradient descent with respect to standard task loss and budget regularization (e.g. FLOPs constraint). In DMCP, we model the channel pruning as a Markov process, in which each state represents for retaining the corresponding channel during pruning, and transitions between states denote the pruning process. In the end, our method is able to implicitly select the proper number of channels in each layer by the Markov process with optimized transitions. To validate the effectiveness of our method, we perform extensive experiments on Imagenet with ResNet and MobilenetV2. Results show our method can achieve consistent improvement than state-of-the-art pruning methods in various FLOPs settings. The code is available at https://github.com/zx55/dmcp |

Gradual Channel Pruning while Training using Feature Relevance Scores for Convolutional Neural Networks2020-04-29 ${\displaystyle \cong }$ |

The enormous inference cost of deep neural networks can be scaled down by network compression. Pruning is one of the predominant approaches used for deep network compression. However, existing pruning techniques have one or more of the following limitations: 1) Additional energy cost on top of the compute heavy training stage due to pruning and fine-tuning stages, 2) Layer-wise pruning based on the statistics of a particular, ignoring the effect of error propagation in the network, 3) Lack of an efficient estimate for determining the important channels globally, 4) Unstructured pruning requires specialized hardware for effective use. To address all the above issues, we present a simple-yet-effective gradual channel pruning while training methodology using a novel data-driven metric referred to as feature relevance score. The proposed technique gets rid of the additional retraining cycles by pruning the least important channels in a structured fashion at fixed intervals during the actual training phase. Feature relevance scores help in efficiently evaluating the contribution of each channel towards the discriminative power of the network. We demonstrate the effectiveness of the proposed methodology on architectures such as VGG and ResNet using datasets such as CIFAR-10, CIFAR-100 and ImageNet, and successfully achieve significant model compression while trading off less than $1\%$ accuracy. Notably on CIFAR-10 dataset trained on ResNet-110, our approach achieves $2.4\times$ compression and a $56\%$ reduction in FLOPs with an accuracy drop of $0.01\%$ compared to the unpruned network. |

Operation-Aware Soft Channel Pruning using Differentiable Masks2020-07-21 ${\displaystyle \cong }$ |

We propose a simple but effective data-driven channel pruning algorithm, which compresses deep neural networks in a differentiable way by exploiting the characteristics of operations. The proposed approach makes a joint consideration of batch normalization (BN) and rectified linear unit (ReLU) for channel pruning; it estimates how likely the two successive operations deactivate each feature map and prunes the channels with high probabilities. To this end, we learn differentiable masks for individual channels and make soft decisions throughout the optimization procedure, which facilitates to explore larger search space and train more stable networks. The proposed framework enables us to identify compressed models via a joint learning of model parameters and channel pruning without an extra procedure of fine-tuning. We perform extensive experiments and achieve outstanding performance in terms of the accuracy of output networks given the same amount of resources when compared with the state-of-the-art methods. |

PruneNet: Channel Pruning via Global Importance2020-05-22 ${\displaystyle \cong }$ |

Channel pruning is one of the predominant approaches for accelerating deep neural networks. Most existing pruning methods either train from scratch with a sparsity inducing term such as group lasso, or prune redundant channels in a pretrained network and then fine tune the network. Both strategies suffer from some limitations: the use of group lasso is computationally expensive, difficult to converge and often suffers from worse behavior due to the regularization bias. The methods that start with a pretrained network either prune channels uniformly across the layers or prune channels based on the basic statistics of the network parameters. These approaches either ignore the fact that some CNN layers are more redundant than others or fail to adequately identify the level of redundancy in different layers. In this work, we investigate a simple-yet-effective method for pruning channels based on a computationally light-weight yet effective data driven optimization step that discovers the necessary width per layer. Experiments conducted on ILSVRC-$12$ confirm effectiveness of our approach. With non-uniform pruning across the layers on ResNet-$50$, we are able to match the FLOP reduction of state-of-the-art channel pruning results while achieving a $0.98\%$ higher accuracy. Further, we show that our pruned ResNet-$50$ network outperforms ResNet-$34$ and ResNet-$18$ networks, and that our pruned ResNet-$101$ outperforms ResNet-$50$. |

Realistic Channel Models Pre-training2019-07-21 ${\displaystyle \cong }$ |

In this paper, we propose a neural-network-based realistic channel model with both the similar accuracy as deterministic channel models and uniformity as stochastic channel models. To facilitate this realistic channel modeling, a multi-domain channel embedding method combined with self-attention mechanism is proposed to extract channel features from multiple domains simultaneously. This 'one model to fit them all' solution employs available wireless channel data as the only data set for self-supervised pre-training. With the permission of users, network operators or other organizations can make use of some available user specific data to fine-tune this pre-trained realistic channel model for applications on channel-related downstream tasks. Moreover, even without fine-tuning, we show that the pre-trained realistic channel model itself is a great tool with its understanding of wireless channel. |

EZCrop: Energy-Zoned Channels for Robust Output Pruning2021-05-08 ${\displaystyle \cong }$ |

Recent results have revealed an interesting observation in a trained convolutional neural network (CNN), namely, the rank of a feature map channel matrix remains surprisingly constant despite the input images. This has led to an effective rank-based channel pruning algorithm, yet the constant rank phenomenon remains mysterious and unexplained. This work aims at demystifying and interpreting such rank behavior from a frequency-domain perspective, which as a bonus suggests an extremely efficient Fast Fourier Transform (FFT)-based metric for measuring channel importance without explicitly computing its rank. We achieve remarkable CNN channel pruning based on this analytically sound and computationally efficient metric and adopt it for repetitive pruning to demonstrate robustness via our scheme named Energy-Zoned Channels for Robust Output Pruning (EZCrop), which shows consistently better results than other state-of-the-art channel pruning methods. |

Channel Gating Neural Networks2019-10-28 ${\displaystyle \cong }$ |

This paper introduces channel gating, a dynamic, fine-grained, and hardware-efficient pruning scheme to reduce the computation cost for convolutional neural networks (CNNs). Channel gating identifies regions in the features that contribute less to the classification result, and skips the computation on a subset of the input channels for these ineffective regions. Unlike static network pruning, channel gating optimizes CNN inference at run-time by exploiting input-specific characteristics, which allows substantially reducing the compute cost with almost no accuracy loss. We experimentally show that applying channel gating in state-of-the-art networks achieves 2.7-8.0$\times$ reduction in floating-point operations (FLOPs) and 2.0-4.4$\times$ reduction in off-chip memory accesses with a minimal accuracy loss on CIFAR-10. Combining our method with knowledge distillation reduces the compute cost of ResNet-18 by 2.6$\times$ without accuracy drop on ImageNet. We further demonstrate that channel gating can be realized in hardware efficiently. Our approach exhibits sparsity patterns that are well-suited to dense systolic arrays with minimal additional hardware. We have designed an accelerator for channel gating networks, which can be implemented using either FPGAs or ASICs. Running a quantized ResNet-18 model for ImageNet, our accelerator achieves an encouraging speedup of 2.4$\times$ on average, with a theoretical FLOP reduction of 2.8$\times$. |

Dynamic Neural Network Channel Execution for Efficient Training2019-05-15 ${\displaystyle \cong }$ |

Existing methods for reducing the computational burden of neural networks at run-time, such as parameter pruning or dynamic computational path selection, focus solely on improving computational efficiency during inference. On the other hand, in this work, we propose a novel method which reduces the memory footprint and number of computing operations required for training and inference. Our framework efficiently integrates pruning as part of the training procedure by exploring and tracking the relative importance of convolutional channels. At each training step, we select only a subset of highly salient channels to execute according to the combinatorial upper confidence bound algorithm, and run a forward and backward pass only on these activated channels, hence learning their parameters. Consequently, we enable the efficient discovery of compact models. We validate our approach empirically on state-of-the-art CNNs - VGGNet, ResNet and DenseNet, and on several image classification datasets. Results demonstrate our framework for dynamic channel execution reduces computational cost up to 4x and parameter count up to 9x, thus reducing the memory and computational demands for discovering and training compact neural network models. |

C2S2: Cost-aware Channel Sparse Selection for Progressive Network Pruning2019-04-06 ${\displaystyle \cong }$ |

This paper describes a channel-selection approach for simplifying deep neural networks. Specifically, we propose a new type of generic network layer, called pruning layer, to seamlessly augment a given pre-trained model for compression. Each pruning layer, comprising $1 \times 1$ depth-wise kernels, is represented with a dual format: one is real-valued and the other is binary. The former enables a two-phase optimization process of network pruning to operate with an end-to-end differentiable network, and the latter yields the mask information for channel selection. Our method progressively performs the pruning task layer-wise, and achieves channel selection according to a sparsity criterion to favor pruning more channels. We also develop a cost-aware mechanism to prevent the compression from sacrificing the expected network performance. Our results for compressing several benchmark deep networks on image classification and semantic segmentation are comparable to those by state-of-the-art. |

Deep Model Compression via Deep Reinforcement Learning2019-12-04 ${\displaystyle \cong }$ |

Besides accuracy, the storage of convolutional neural networks (CNN) models is another important factor considering limited hardware resources in practical applications. For example, autonomous driving requires the design of accurate yet fast CNN for low latency in object detection and classification. To fulfill the need, we aim at obtaining CNN models with both high testing accuracy and small size/storage to address resource constraints in many embedded systems. In particular, this paper focuses on proposing a generic reinforcement learning based model compression approach in a two-stage compression pipeline: pruning and quantization. The first stage of compression, i.e., pruning, is achieved via exploiting deep reinforcement learning (DRL) to co-learn the accuracy of CNN models updated after layer-wise channel pruning on a testing dataset and the FLOPs, number of floating point operations in each layer, updated after kernel-wise variational pruning using information dropout. Layer-wise channel pruning is to remove unimportant kernels from the input channel dimension while kernel-wise variational pruning is to remove unimportant kernels from the 2D-kernel dimensions, namely, height and width. The second stage, i.e., quantization, is achieved via a similar DRL approach but focuses on obtaining the optimal weight bits for individual layers. We further conduct experimental results on CIFAR-10 and ImageNet datasets. For the CIFAR-10 dataset, the proposed method can reduce the size of VGGNet by 9x from 20.04MB to 2.2MB with 0.2% accuracy increase. For the ImageNet dataset, the proposed method can reduce the size of VGG-16 by 33x from 138MB to 4.14MB with no accuracy loss. |

DAIS: Automatic Channel Pruning via Differentiable Annealing Indicator Search2020-11-04 ${\displaystyle \cong }$ |

The convolutional neural network has achieved great success in fulfilling computer vision tasks despite large computation overhead against efficient deployment. Structured (channel) pruning is usually applied to reduce the model redundancy while preserving the network structure, such that the pruned network can be easily deployed in practice. However, existing structured pruning methods require hand-crafted rules which may lead to tremendous pruning space. In this paper, we introduce Differentiable Annealing Indicator Search (DAIS) that leverages the strength of neural architecture search in the channel pruning and automatically searches for the effective pruned model with given constraints on computation overhead. Specifically, DAIS relaxes the binarized channel indicators to be continuous and then jointly learns both indicators and model parameters via bi-level optimization. To bridge the non-negligible discrepancy between the continuous model and the target binarized model, DAIS proposes an annealing-based procedure to steer the indicator convergence towards binarized states. Moreover, DAIS designs various regularizations based on a priori structural knowledge to control the pruning sparsity and to improve model performance. Experimental results show that DAIS outperforms state-of-the-art pruning methods on CIFAR-10, CIFAR-100, and ImageNet. |

Lossless CNN Channel Pruning via Gradient Resetting and Convolutional Re-parameterization2020-07-07 ${\displaystyle \cong }$ |

Channel pruning (a.k.a. filter pruning) aims to slim down a convolutional neural network (CNN) by reducing the width (i.e., numbers of output channels) of convolutional layers. However, as CNN's representational capacity depends on the width, doing so tends to degrade the performance. A traditional learning-based channel pruning paradigm applies a penalty on parameters to improve the robustness to pruning, but such a penalty may degrade the performance even before pruning. Inspired by the neurobiology research about the independence of remembering and forgetting, we propose to re-parameterize a CNN into the remembering parts and forgetting parts, where the former learn to maintain the performance and the latter learn for efficiency. By training the re-parameterized model using regular SGD on the former but a novel update rule with penalty gradients on the latter, we achieve structured sparsity, enabling us to equivalently convert the re-parameterized model into the original architecture with narrower layers. With our method, we can slim down a standard ResNet-50 with 76.15\% top-1 accuracy on ImageNet to a narrower one with only 43.9\% FLOPs and no accuracy drop. Code and models are released at https://github.com/DingXiaoH/ResRep. |

Learning the Wireless V2I Channels Using Deep Neural Networks2019-07-10 ${\displaystyle \cong }$ |

For high data rate wireless communication systems, developing an efficient channel estimation approach is extremely vital for channel detection and signal recovery. With the trend of high-mobility wireless communications between vehicles and vehicles-to-infrastructure (V2I), V2I communications pose additional challenges to obtaining real-time channel measurements. Deep learning (DL) techniques, in this context, offer learning ability and optimization capability that can approximate many kinds of functions. In this paper, we develop a DL-based channel prediction method to estimate channel responses for V2I communications. We have demonstrated how fast neural networks can learn V2I channel properties and the changing trend. The network is trained with a series of channel responses and known pilots, which then speculates the next channel response based on the acquired knowledge. The predicted channel is then used to evaluate the system performance. |

Channel Estimation via Successive Denoising in MIMO OFDM Systems: A Reinforcement Learning Approach2021-01-25 ${\displaystyle \cong }$ |

Reliable communication through multiple-input multiple-output (MIMO) orthogonal frequency division multiplexing (OFDM) requires accurate channel estimation. Existing literature largely focuses on denoising methods for channel estimation that are dependent on either (i) channel analysis in the time-domain, and/or (ii) supervised learning techniques, requiring large pre-labeled datasets for training. To address these limitations, we present a frequency-domain denoising method based on the application of a reinforcement learning framework that does not need a priori channel knowledge and pre-labeled data. Our methodology includes a new successive channel denoising process based on channel curvature computation, for which we obtain a channel curvature magnitude threshold to identify unreliable channel estimates. Based on this process, we formulate the denoising mechanism as a Markov decision process, where we define the actions through a geometry-based channel estimation update, and the reward function based on a policy that reduces the MSE. We then resort to Q-learning to update the channel estimates over the time instances. Numerical results verify that our denoising algorithm can successfully mitigate noise in channel estimates. In particular, our algorithm provides a significant improvement over the practical least squares (LS) channel estimation method and provides performance that approaches that of the ideal linear minimum mean square error (LMMSE) with perfect knowledge of channel statistics. |

Meta-Learning to Communicate: Fast End-to-End Training for Fading Channels2019-10-22 ${\displaystyle \cong }$ |

When a channel model is available, learning how to communicate on fading noisy channels can be formulated as the (unsupervised) training of an autoencoder consisting of the cascade of encoder, channel, and decoder. An important limitation of the approach is that training should be generally carried out from scratch for each new channel. To cope with this problem, prior works considered joint training over multiple channels with the aim of finding a single pair of encoder and decoder that works well on a class of channels. As a result, joint training ideally mimics the operation of non-coherent transmission schemes. In this paper, we propose to obviate the limitations of joint training via meta-learning: Rather than training a common model for all channels, meta-learning finds a common initialization vector that enables fast training on any channel. The approach is validated via numerical results, demonstrating significant training speed-ups, with effective encoders and decoders obtained with as little as one iteration of Stochastic Gradient Descent. |

Network Pruning via Annealing and Direct Sparsity Control2020-07-26 ${\displaystyle \cong }$ |

Artificial neural networks (ANNs) especially deep convolutional networks are very popular these days and have been proved to successfully offer quite reliable solutions to many vision problems. However, the use of deep neural networks is widely impeded by their intensive computational and memory cost. In this paper, we propose a novel efficient network pruning method that is suitable for both non-structured and structured channel-level pruning. Our proposed method tightens a sparsity constraint by gradually removing network parameters or filter channels based on a criterion and a schedule. The attractive fact that the network size keeps dropping throughout the iterations makes it suitable for the pruning of any untrained or pre-trained network. Because our method uses a $L_0$ constraint instead of the $L_1$ penalty, it does not introduce any bias in the training parameters or filter channels. Furthermore, the $L_0$ constraint makes it easy to directly specify the desired sparsity level during the network pruning process. Finally, experimental validation on extensive synthetic and real vision datasets show that the proposed method obtains better or competitive performance compared to other states of art network pruning methods. |

PCAS: Pruning Channels with Attention Statistics for Deep Network Compression2019-08-20 ${\displaystyle \cong }$ |

Compression techniques for deep neural networks are important for implementing them on small embedded devices. In particular, channel-pruning is a useful technique for realizing compact networks. However, many conventional methods require manual setting of compression ratios in each layer. It is difficult to analyze the relationships between all layers, especially for deeper models. To address these issues, we propose a simple channel-pruning technique based on attention statistics that enables to evaluate the importance of channels. We improved the method by means of a criterion for automatic channel selection, using a single compression ratio for the entire model in place of per-layer model analysis. The proposed approach achieved superior performance over conventional methods with respect to accuracy and the computational costs for various models and datasets. We provide analysis results for behavior of the proposed criterion on different datasets to demonstrate its favorable properties for channel pruning. |

Layer Pruning for Accelerating Very Deep Neural Networks2019-10-28 ${\displaystyle \cong }$ |

In this paper, we propose an adaptive pruning method. This method can cut off the channel and layer adaptively. The proportion of the layer and the channel to be cut is learned adaptively. The pruning method proposed in this paper can reduce half of the parameters, and the accuracy will not decrease or even be higher than baseline. |