複雜高維資料分析研討會 Workshop on complex and highdimensional data analysis
日期: 104年12月9日至104年12月10日
地點: 國立中山大學 國研大樓 1 樓華立廳(12月9日)及理學院四樓理SC 40091(12月10日)
為了探討巨量資料分析的方法及促進國內外理論與實務經驗的分享， 特舉辦這次研討會。本研討會中預計邀請15位來自美國、香港、新加坡及國內的專家學者，分別就其學術研究經驗或是由實際問題之巨量資料分析結果報告並進行研討。希望藉此研討會提供有興趣此領域的學者們交換研究心得的機會及一個溝通橋樑，促成跨領域的合作研究，同時也有助於激發學生進入本領域的興趣。研討會將以workshop的形式進行，邀請國內外研究高維複雜資料的統計學者與資訊工程專家，對該主題的研究與發展做一詳盡的報告，並邀請國內對此議題感興趣的學者專家參與討論。
主辦單位: 國立中山大學跨領域及數據科學研究中心
協辦單位: 國立中山大學應用數學系、國家理論科學研究中心數學領域
主辦人: 郭美惠、黃杰森
議程委員: 銀慶剛、黃信誠、李育杰
線上報名 http://hp1.math.nsysu.edu.tw/conference/wchdda2015/
研討會照片: http://www.math.nsysu.edu.tw/conference/wchdda2015/album/
議程
日期 
12月9日（三） 
日期 
12月10日（四） 
9:10~9:30 
報到 


9:30~9:40 
開幕式 


9:40~10:40 
Chair: 郭美惠教授 
9:10~10:10  Chair: 李育杰教授 
10:40~10:50 
Coffee break 
10:10~10:20 
Coffee break 
10:50~12:20 
10:20~12:30 

12:30~13:00 
揭牌儀式及團體照 
12:30~13:30  午餐 
13:00~13:50 
午餐 


13:50~14:50 
Chair: 杜憶萍教授 

14:50~15:50 

15:50~16:00 
Coffee break 

16:00~17:00 
Chair: 銀慶剛教授 

17:00~18:00 

18:00~20:30 
Banquet (Invite Only) 
主講者：蔡瑞胸（Ruey S. Tsay）
Booth School of Business, University of Chicago, USA
Title: New Frontier and its Challenges
Abstract:
Most studies in big data focus on independent observations. On the other hand, realworld big data often have dynamic dependence. Hourly readings of air pollutants are collected widely at various monitoring stations. These data could be big and they are dynamically dependent. In finance, transactionbytransaction data of many stocks are available in most exchanges. These data are also big and dynamically dependent. Are the methods developed for independent big data continue to apply in the presence of dynamic dependence? In this talk, we discuss issues facing analysis of dependent big data. We explore the applicability of some statistical methods developed for independent big data and discuss new challenges when data are dynamically dependent. We also consider some simple methods in time series analysis that can be extended to handle dependent big data. Both real and simulated examples are used to demonstrate the key concepts and applications.
主講者：邱俊業（YAU, Chun Yip）
Department of Statistics, Chinese University of Hong Kong, Hong Kong
Title: LARStype algorithm for group lasso
Abstract:
The least absolute shrinkage and selection operator (lasso) has been widely used in regression analysis. Based on the piecewise linear property of the solution path, least angle regression (LARS) provides an efficient algorithm for computing the solution paths of lasso. Group lasso is an important generalization of lasso that can be applied to regression with grouped variables. However, the solution path of group lasso is not piecewise linear and hence cannot be obtained by least angle regression. By transforming the problem into a system of differential equations, we develop an algorithm for efficient computation of group lasso solution paths.
主講者：林良靖（LiangChing Lin）
Department of Statistics, National Cheng Kung University
Title: Robust principal expectile component analysis
Abstract:
Principal component analysis (PCA) is a widely used dimension reduction technique especially for the high dimension data analysis. These principal components are identified by sequentially maximizing the component score variance for observations centered on the sample mean. However, in practice, one might be more interested in the variation captured by the tail characters instead of the sample mean, for example the analysis of expected shortfall. To properly capture the tail characters, principle expectile component (PEC) analysis was proposed based on an asymmetric L2 norm (Tran, Osipenko, and Härdle, 2014). Although, in order to achieve robustness against outliers, we generalize the PEC by integrating with Huber's norm. The newly proposed method is named as principal Hubertype expectile components (PHEC). A derivative free optimization approach, particle swarm optimization (PSO), is adopted to efficiently identify the components in PHEC. Simulation studies show that the PHEC outperforms PCA and PEC in capturing the tail variation in the case of normal mixture distributions. Finally a real example is analyzed for illustration.
(Joint work with Ray Bing Chen, MongNa Lo Huang and Meihui Guo)
主講者：黃士峰（ShihFeng Huang）
Institute of Statistics, National University of Kaohsiung
Title: A Highdimensional LocationDispersion Model with Applications to Root Cause Detection for Wafer Fabrication Processes
Abstract:
We consider the problem of selecting a highdimensional location dispersion model. The orthogonal greedy algorithm (OGA) in conjunction with the highdimensional information criterion (HDIC) and TRIM used by Ing and Lai (2011) in highdimensional homogeneity models is generalized to accommodate highdimensional dispersion components. We prove selection consistency and derive the limiting distributions of the estimated parameters. These results are then applied to root cause detection for wafer fabrication processes, in which a problematic tool can lead to not only a location shift, but also an increase in variance. In particular, based on the variables selected by OGA+HDIC+TRIM, a novel and easy to implement procedure is proposed to identify problematic tools. Moreover, since parameter estimation involves solving nonlinear optimization problems with many variables, we provide two iterative algorithms to alleviate this difficulty. Real data analysis shows that the proposed method performs quite satisfactorily.
主講者：李克昭（Ker Chau Li）
Institute of Statistical Science, Academia Sinica
Title: Perspectives of machine learning and multivariate statistical analysis for deep data analytics
Abstract:
As a fundamental training component in the curriculum of the emerging data science, machine learning (ML) in computer science and multivariate analysis (MA) in statistics overlap substantially. In this talk, I will discuss how to exploit the commonality and uniqueness between MA and ML for deep data analytics which requires a thoughtful planning of layers and layers of analyses.
主講者：杜憶萍（I Ping Tu）
Institute of Statistical Science, Academia Sinica
Title: One Dimenstion Reduction Method for Tensor Structure Data and its Application on cryoEM Image Analysis
Abstract:
Dimension reduction is one key step in statistical analysis for high dimensional data. When each observation is a matrix or a higher order tensor, the traditional approach is to vectorize the data before executing reduction algorithms. This approach often leads to an extremely high dimensional problem which comes along with intensive computations and inefficient estimations. High order SVD and Multilinear principal component analysis (MPCA) are thus proposed for tensor structure data. They reduce each mode space of the tensor separately and thus reduce the computations significantly. One criticism to the new approach is that, unlike PCA, the projected data in the reduced space are still correlated. To this end, we propose a two stage dimension reduction method, called structure PCA (SPCA). SPCA employs MPCA on the tensor data first, and then applies PCA on the vectorized projected core scores from MPCA. A successful application of SPCA on a cryoelectron microscopy image data will also be presented
主講者：唐桦(Hua Tang)
Department of Genetics, Stanford University, USA
Title: A high dimensional graphical model for count data
Abstract:
With nextgeneration sequencing technologies generating largescale RNA transcriptomic data, there is a great interests in constructing complex networks that provide new insights in the coordination of gene regulation. A number of statistical methods have been developed for constructing networks based on a Gaussian assumption that may not be appropriate for nonGaussian data such as RNAseq data. In this study, we propose a novel statistical approach that models the observed counts using a Poisson lognormal distribution. This approach is based on maximizing a penalized likelihood. To overcome the computational challenge, we use Laplace integration to approximate the likelihood and its gradients, and apply the alternating directions method of multipliers to find maximum likelihood estimates. The performance of the proposed method is illustrated and compared with Gaussian models, using both simulated and real RNAseq data. The proposed method shows improved performance, in detecting edges that represent covarying pairs of genes, over the Gaussian models, particularly for lowabundant genes.
主講者：吳偉標（Wei Biao Wu）
Department of Statistics, University of Chicago, USA
Title: Hypothesis Testing for HighDimensional Data
Abstract:
We present a systematic theory for the problem of testing means of highdimensional data. Our testing procedure is based on an invariance principle which provides distributional approximations of functionals of nonGaussian vectors by those of Gaussian ones. Differently from the widely used Bonferroni approach, our procedure is dependenceadjusted and has an asymptotically correct size and power. To obtain cutoff values of our test, we propose a halfsampling method which avoids estimating the underlying covariance matrix of the random vectors. The latter method is shown via extensive simulations to have an excellent performance.
主講者：銀慶剛（ChingKang Ing）
Department of Statistical Science, Academia Sinica
Title: Model Selection for HighDimensional Misspecified Time Series
Abstract:
In this talk, I will address the challenging problem of choosing models for highdimensional misspecified time series. I will first develop rates of convergence of the orthogonal greedy algorithm (OGA) under model misspecification. I will then apply this result to establish the sure screening property of OGA and the oracle property of highdimensional information criterion (HDIC). Finally, I will briefly touch upon some applications to highdimensional interaction models.
主講者：潘光明（Guangming Pan）
Division of Mathematical Sciences, Nanyang Technological University, Singapore
Title: Universality of the largest eigenvalues of F types matrices and CCA.
Abstract:
This talk is about the asymptotic distribution of the largest eigenvalues of F type matrices and Canonical Correlation Analysis (CCA) when the sample size and dimensions both go to infinity with their ratio being a positive constant. It is proved that TracyWidom's law holds for the largest eigenvalues of these matrices under some moment assumptions.
主講者：盧鴻興（HorngShing Lu）
Institute of Statistics, National Chiao Tung University
Title: Bridging density functional theory and big data analytics with applications
Abstract:
Bridging density functional theory and big data analytics with pplicationsThe framework of the density functional theory (DFT) reveals both strong suitability and compatibility for investigating largescale systems in the Big Data regime. By technically mapping the data space into physically meaningful bases, the article provided a simple procedure to formulate global Lagrangian and Hamiltonian density functionals to circumvent the emerging challenges on largescale data analyses. Then, the informative features of mixed datasets and the corresponding clustering morphologies can be visually elucidated by means of the evaluations of global density functionals. Simulation results of dataclustering illustrate that the proposed methodology provides an alternative route for analyzing the data characteristics with abundant physical insights. For a comprehensive demonstration in a high dimensional problem without prior ground truth, the developed density functionals are also applied on the postprocess of magnetic resonance imaging (MRI) and better tumor recognitions can be achieved on the T1 postcontrast and T2 modes. It is appealing that the postprocessing MRI using the proposed DFTbased algorithm would help the scientists in the judgment of clinical pathology and the applications of high dimensional biomedical image processing. Eventually, successful high dimensional data analyses reveal that the proposed DFTbased algorithm has the potential to be used as a framework for investigations of largescale complex systems.
主講者：李育杰（YuhJye Lee）
Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology
Title: Multiclass Classification via Kernel Sliced Inverse Regression
Abstract:
Kernel sliced inverse regression (KSIR) is known as a nonlinear supervised dimension reduction methods that aims to extract a lowdimensional feature e.d.r. subspace that contains the information about output variable y as much as possible. It will help us to visualize the structure of data encoding the output variable information. We apply KSIR to solving the multiclassification problems with the number of classes more than three. The distinguishable classes will lie on the coordinate axis. We remove the top three distinguishable classes from the dataset. Then apply the KSIR to the remaining data points. Repeat these two steps iteratively. Finally, we can have a classification tree that decomposes the original classification problem into a series of subproblems. In contrast to the one vs. the rest and one vs. one schemes that are the conventional way to decompose a multiclass classification problem into a series of binary classification problems, our proposed method utilize the information within the classes that extracted via KSIR.
主講者：王鈺強（Y.C. Frank Wang）
Research Center for IT Innovation, Academia Sinica
Title: Domain Adaptation: Bridging CrossDomain Data and Beyond
Abstract:
Domain adaptation aims to associate data and address learning tasks observed in different domains. For example, one could observe facial images of the subjects of interest in advance, but their sketch images are to be recognized. Another example is that, one needs to recognize different object images taken by mobile phones, while the training images are collected over the Internet (with different resolutions, viewpoints, etc.). Since training and test data are collected in distinct domains, the corresponding data distributions would be very different. As a result, traditional machine learning algorithms cannot be directly applied. In this talk, I will cover several recently developed learning algorithms for handling crossdomain data in a variety of settings, and talk about a number of applications in visual content recognition and synthesis.
主講者：彭捷(Jie Peng)
Department of Statistics, University of California, Davis, USA
Title: Diffusion MRI: direction estimation and fiber tracking
Abstract:
Diffusion MRI is a Magnetic Resonance Imaging technology which uses water diffusion as a proxy to probe the anatomy of biological tissues in an invivo and noninvasive way. Diffusion MRI has been widely used to reconstruct white matter fiber tracts and to provide information on structure connectivity of the brain . We will start with the diffusion tensor models where water diffusion is modeled by Gaussian processes and characterized by tensors (covariance matrices). However, tensor models have difficulties in resolving multiple crossing fibers within a voxel. We circumvents this problem by conducting direct diffusion direction estimation and smoothing. We then discuss a fiber tracking algorithm which uses the estimated directions as input and is applicable to crossing fiber regions.
主講者：李弘毅（Hungyi Lee）
Department of Electrical Engineering, National Taiwan University
Title: Deep Learning and its application in speech recognition
Abstract:
By establishing new stateoftheart performance in speech recognition, image recognition, and some natural language processing tasks, deep learning techniques have achieved tremendous successes in recent years. The focus of this talk is on deep learning approaches to speech recognition. I will first give a quick overview on the latest deep learning technology which makes the deep learning techniques today very different from its 1980s ancestors. Then I will show how to apply deep learning in speech recognition and explain why it perform so well. Finally, the possible next waves of deep learning technology in speech recognition will be discussed, for example, endtoend speech recognition.