複雜高維資料分析研討會 Workshop on complex and high-dimensional data analysis

日期: 104年12月9日至104年12月10日

地點: 國立中山大學國研大樓 1 樓華立廳(12月9日)及理學院四樓理SC 4009-1(12月10日)

為了探討巨量資料分析的方法及促進國內外理論與實務經驗的分享，特舉辦這次研討會。本研討會中預計邀請15位來自美國、香港、新加坡及國內的專家學者，分別就其學術研究經驗或是由實際問題之巨量資料分析結果報告並進行研討。希望藉此研討會提供有興趣此領域的學者們交換研究心得的機會及一個溝通橋樑，促成跨領域的合作研究，同時也有助於激發學生進入本領域的興趣。研討會將以workshop的形式進行，邀請國內外研究高維複雜資料的統計學者與資訊工程專家，對該主題的研究與發展做一詳盡的報告，並邀請國內對此議題感興趣的學者專家參與討論。

主辦單位: 國立中山大學跨領域及數據科學研究中心
協辦單位: 國立中山大學應用數學系、國家理論科學研究中心數學領域
主辦人: 郭美惠、黃杰森
議程委員: 銀慶剛、黃信誠、李育杰

線上報名 http://hp1.math.nsysu.edu.tw/conference/wchdda2015/

研討會照片: http://www.math.nsysu.edu.tw/conference/wchdda2015/album/

議程

日期地點	12月9日（三）國研大樓1樓華立廳	日期地點	12月10日（四）理學院四樓理SC4009-1
9:10~9:30	報到
9:30~9:40	開幕式校長致辭
9:40~10:40	Chair: 郭美惠教授蔡瑞胸院士	9:10~10:10	Chair: 李育杰教授盧鴻興教授
10:40~10:50	Coffee break	10:10~10:20	Coffee break
10:50~12:20	Chair: 黃郁芬教授邱俊業教授林良靖教授黃士峰教授	10:20~12:30	Chair: 俞淑惠教授李育杰教授王鈺強教授彭捷教授李宏毅教授
12:30~13:00	揭牌儀式及團體照地點：理學院中庭	12:30~13:30	午餐
13:00~13:50	午餐
13:50~14:50	Chair: 杜憶萍教授李克昭院士
14:50~15:50	Chair: 張中教授杜憶萍教授唐桦教授
15:50~16:00	Coffee break
16:00~17:00	Chair: 銀慶剛教授吳偉標教授
17:00~18:00	Chair: 樊采虹教授銀慶剛教授潘光明教授
18:00~20:30	Banquet (Invite Only)

主講者：蔡瑞胸（Ruey S. Tsay）
Booth School of Business, University of Chicago, USA

Title: New Frontier and its Challenges
Abstract:
Most studies in big data focus on independent observations. On the other hand, real-world big data often have dynamic dependence. Hourly readings of air pollutants are collected widely at various monitoring stations. These data could be big and they are dynamically dependent. In finance, transaction-by-transaction data of many stocks are available in most exchanges. These data are also big and dynamically dependent. Are the methods developed for independent big data continue to apply in the presence of dynamic dependence? In this talk, we discuss issues facing analysis of dependent big data. We explore the applicability of some statistical methods developed for independent big data and discuss new challenges when data are dynamically dependent. We also consider some simple methods in time series analysis that can be extended to handle dependent big data. Both real and simulated examples are used to demonstrate the key concepts and applications.

主講者：邱俊業（YAU, Chun Yip）
Department of Statistics, Chinese University of Hong Kong, Hong Kong

Title: LARS-type algorithm for group lasso
Abstract:
The least absolute shrinkage and selection operator (lasso) has been widely used in regression analysis. Based on the piecewise linear property of the solution path, least angle regression (LARS) provides an efficient algorithm for computing the solution paths of lasso. Group lasso is an important generalization of lasso that can be applied to regression with grouped variables. However, the solution path of group lasso is not piecewise linear and hence cannot be obtained by least angle regression. By transforming the problem into a system of differential equations, we develop an algorithm for efficient computation of group lasso solution paths.

主講者：林良靖（Liang-Ching Lin）
Department of Statistics, National Cheng Kung University

Title: Robust principal expectile component analysis
Abstract:
Principal component analysis (PCA) is a widely used dimension reduction technique especially for the high dimension data analysis. These principal components are identified by sequentially maximizing the component score variance for observations centered on the sample mean. However, in practice, one might be more interested in the variation captured by the tail characters instead of the sample mean, for example the analysis of expected shortfall. To properly capture the tail characters, principle expectile component (PEC) analysis was proposed based on an asymmetric L2 norm (Tran, Osipenko, and Härdle, 2014). Although, in order to achieve robustness against outliers, we generalize the PEC by integrating with Huber's norm. The newly proposed method is named as principal Huber-type expectile components (PHEC). A derivative free optimization approach, particle swarm optimization (PSO), is adopted to efficiently identify the components in PHEC. Simulation studies show that the PHEC outperforms PCA and PEC in capturing the tail variation in the case of normal mixture distributions. Finally a real example is analyzed for illustration.
(Joint work with Ray Bing Chen, Mong-Na Lo Huang and Meihui Guo)

主講者：黃士峰（Shih-Feng Huang）
Institute of Statistics, National University of Kaohsiung

Title: A High-dimensional Location-Dispersion Model with Applications to Root Cause Detection for Wafer Fabrication Processes
Abstract:
We consider the problem of selecting a high-dimensional location- dispersion model. The orthogonal greedy algorithm (OGA) in conjunction with the high-dimensional information criterion (HDIC) and TRIM used by Ing and Lai (2011) in high-dimensional homogeneity models is generalized to accommodate high-dimensional dispersion components. We prove selection consistency and derive the limiting distributions of the estimated parameters. These results are then applied to root cause detection for wafer fabrication processes, in which a problematic tool can lead to not only a location shift, but also an increase in variance. In particular, based on the variables selected by OGA+HDIC+TRIM, a novel and easy to implement procedure is proposed to identify problematic tools. Moreover, since parameter estimation involves solving nonlinear optimization problems with many variables, we provide two iterative algorithms to alleviate this difficulty. Real data analysis shows that the proposed method performs quite satisfactorily.

主講者：李克昭（Ker Chau Li）
Institute of Statistical Science, Academia Sinica

Title: Perspectives of machine learning and multivariate statistical analysis for deep data analytics
Abstract:
As a fundamental training component in the curriculum of the emerging data science, machine learning (ML) in computer science and multivariate analysis (MA) in statistics overlap substantially. In this talk, I will discuss how to exploit the commonality and uniqueness between MA and ML for deep data analytics which requires a thoughtful planning of layers and layers of analyses.

主講者：杜憶萍（I Ping Tu）
Institute of Statistical Science, Academia Sinica

Title: One Dimenstion Reduction Method for Tensor Structure Data and its Application on cryo-EM Image Analysis
Abstract:
Dimension reduction is one key step in statistical analysis for high dimensional data. When each observation is a matrix or a higher order tensor, the traditional approach is to vectorize the data before executing reduction algorithms. This approach often leads to an extremely high dimensional problem which comes along with intensive computations and inefficient estimations. High order SVD and Multilinear principal component analysis (MPCA) are thus proposed for tensor structure data. They reduce each mode space of the tensor separately and thus reduce the computations significantly. One criticism to the new approach is that, unlike PCA, the projected data in the reduced space are still correlated. To this end, we propose a two stage dimension reduction method, called structure PCA (SPCA). SPCA employs MPCA on the tensor data first, and then applies PCA on the vectorized projected core scores from MPCA. A successful application of SPCA on a cryo-electron microscopy image data will also be presented

主講者：唐桦(Hua Tang)
Department of Genetics, Stanford University, USA

Title: A high dimensional graphical model for count data
Abstract:
With next-generation sequencing technologies generating large-scale RNA transcriptomic data, there is a great interests in constructing complex networks that provide new insights in the coordination of gene regulation. A number of statistical methods have been developed for constructing networks based on a Gaussian assumption that may not be appropriate for non-Gaussian data such as RNA-seq data. In this study, we propose a novel statistical approach that models the observed counts using a Poisson lognormal distribution. This approach is based on maximizing a penalized likelihood. To overcome the computational challenge, we use Laplace integration to approximate the likelihood and its gradients, and apply the alternating directions method of multipliers to find maximum likelihood estimates. The performance of the proposed method is illustrated and compared with Gaussian models, using both simulated and real RNA-seq data. The proposed method shows improved performance, in detecting edges that represent co-varying pairs of genes, over the Gaussian models, particularly for low-abundant genes.

主講者：吳偉標（Wei Biao Wu）
Department of Statistics, University of Chicago, USA

Title: Hypothesis Testing for High-Dimensional Data
Abstract:
We present a systematic theory for the problem of testing means of high-dimensional data. Our testing procedure is based on an invariance principle which provides distributional approximations of functionals of non-Gaussian vectors by those of Gaussian ones. Differently from the widely used Bonferroni approach, our procedure is dependence-adjusted and has an asymptotically correct size and power. To obtain cutoff values of our test, we propose a half-sampling method which avoids estimating the underlying covariance matrix of the random vectors. The latter method is shown via extensive simulations to have an excellent performance.

主講者：銀慶剛（Ching-Kang Ing）
Department of Statistical Science, Academia Sinica

Title: Model Selection for High-Dimensional Misspecified Time Series
Abstract:
In this talk, I will address the challenging problem of choosing models for high-dimensional misspecified time series. I will first develop rates of convergence of the orthogonal greedy algorithm (OGA) under model misspecification. I will then apply this result to establish the sure screening property of OGA and the oracle property of high-dimensional information criterion (HDIC). Finally, I will briefly touch upon some applications to high-dimensional interaction models.

主講者：潘光明（Guangming Pan）
Division of Mathematical Sciences, Nanyang Technological University, Singapore

Title: Universality of the largest eigenvalues of F types matrices and CCA.
Abstract:
This talk is about the asymptotic distribution of the largest eigenvalues of F type matrices and Canonical Correlation Analysis (CCA) when the sample size and dimensions both go to infinity with their ratio being a positive constant. It is proved that Tracy-Widom's law holds for the largest eigenvalues of these matrices under some moment assumptions.

主講者：盧鴻興（Horng-Shing Lu）
Institute of Statistics, National Chiao Tung University

Title: Bridging density functional theory and big data analytics with applications
Abstract:
Bridging density functional theory and big data analytics with pplicationsThe framework of the density functional theory (DFT) reveals both strong suitability and compatibility for investigating large-scale systems in the Big Data regime. By technically mapping the data space into physically meaningful bases, the article provided a simple procedure to formulate global Lagrangian and Hamiltonian density functionals to circumvent the emerging challenges on large-scale data analyses. Then, the informative features of mixed datasets and the corresponding clustering morphologies can be visually elucidated by means of the evaluations of global density functionals. Simulation results of dataclustering illustrate that the proposed methodology provides an alternative route for analyzing the data characteristics with abundant physical insights. For a comprehensive demonstration in a high dimensional problem without prior ground truth, the developed density functionals are also applied on the post-process of magnetic resonance imaging (MRI) and better tumor recognitions can be achieved on the T1 post-contrast and T2 modes. It is appealing that the post-processing MRI using the proposed DFT-based algorithm would help the scientists in the judgment of clinical pathology and the applications of high dimensional biomedical image processing. Eventually, successful high dimensional data analyses reveal that the proposed DFT-based algorithm has the potential to be used as a framework for investigations of large-scale complex systems.

主講者：李育杰（Yuh-Jye Lee）
Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology

Title: Multi-class Classification via Kernel Sliced Inverse Regression
Abstract:
Kernel sliced inverse regression (KSIR) is known as a nonlinear supervised dimension reduction methods that aims to extract a low-dimensional feature e.d.r. subspace that contains the information about output variable y as much as possible. It will help us to visualize the structure of data encoding the output variable information. We apply KSIR to solving the multi-classification problems with the number of classes more than three. The distinguishable classes will lie on the coordinate axis. We remove the top three distinguishable classes from the dataset. Then apply the KSIR to the remaining data points. Repeat these two steps iteratively. Finally, we can have a classification tree that decomposes the original classification problem into a series of sub-problems. In contrast to the one vs. the rest and one vs. one schemes that are the conventional way to decompose a multi-class classification problem into a series of binary classification problems, our proposed method utilize the information within the classes that extracted via KSIR.

主講者：王鈺強（Y.-C. Frank Wang）
Research Center for IT Innovation, Academia Sinica

Title: Domain Adaptation: Bridging Cross-Domain Data and Beyond
Abstract:
Domain adaptation aims to associate data and address learning tasks observed in different domains. For example, one could observe facial images of the subjects of interest in advance, but their sketch images are to be recognized. Another example is that, one needs to recognize different object images taken by mobile phones, while the training images are collected over the Internet (with different resolutions, viewpoints, etc.). Since training and test data are collected in distinct domains, the corresponding data distributions would be very different. As a result, traditional machine learning algorithms cannot be directly applied. In this talk, I will cover several recently developed learning algorithms for handling cross-domain data in a variety of settings, and talk about a number of applications in visual content recognition and synthesis.

主講者：彭捷(Jie Peng)
Department of Statistics, University of California, Davis, USA

Title: Diffusion MRI: direction estimation and fiber tracking
Abstract:
Diffusion MRI is a Magnetic Resonance Imaging technology which uses water diffusion as a proxy to probe the anatomy of biological tissues in an in-vivo and non-invasive way. Diffusion MRI has been widely used to reconstruct white matter fiber tracts and to provide information on structure connectivity of the brain . We will start with the diffusion tensor models where water diffusion is modeled by Gaussian processes and characterized by tensors (covariance matrices). However, tensor models have difficulties in resolving multiple crossing fibers within a voxel. We circumvents this problem by conducting direct diffusion direction estimation and smoothing. We then discuss a fiber tracking algorithm which uses the estimated directions as input and is applicable to crossing fiber regions.

主講者：李弘毅（Hung-yi Lee）
Department of Electrical Engineering, National Taiwan University

Title: Deep Learning and its application in speech recognition
Abstract:
By establishing new state-of-the-art performance in speech recognition, image recognition, and some natural language processing tasks, deep learning techniques have achieved tremendous successes in recent years. The focus of this talk is on deep learning approaches to speech recognition. I will first give a quick overview on the latest deep learning technology which makes the deep learning techniques today very different from its 1980s ancestors. Then I will show how to apply deep learning in speech recognition and explain why it perform so well. Finally, the possible next waves of deep learning technology in speech recognition will be discussed, for example, end-to-end speech recognition.

program.jpg

瀏覽數: