Algorithms 56-59. Cluster analysis by the methods of Hubert
For an n x m real matrix A the matrix A⊥ is defined as a matrix spanning the orthocomplement of the column space of A, when the orthogonality is defined with respect to the standard inner product ⟨x, y⟩ = x'y. In this paper we collect together various properties of the ⊥ operation and its applications in linear statistical models. Results covering the more general inner products are also considered. We also provide a rather extensive list of references
The problem of decomposing a given covariance matrix as the sum of a positive semi-definite matrix of given rank and a positive semi-definite diagonal matrix, is considered. We present a projection-type algorithm to address this problem. This algorithm appears to perform extremely well and is extremely fast even when the given covariance matrix has a very large dimension. The effectiveness of the algorithm is assessed through simulation studies and by applications to three real benchmark datasets...
Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The -means algorithm is well known for its efficiency in this respect. At the same time, working only on...
We revisit Sklar’s Theorem and give another proof, primarily based on the use of right quantile functions. To this end we slightly generalise the distributional transform approach of Rüschendorf and facilitate some new results including a rigorous characterisation of an almost surely existing “left-invertibility” of distribution functions.
It is known that the identifiability of multivariate mixtures reduces to a question in algebraic geometry. We solve the question by studying certain generators in the ring of polynomials in vector variables, invariant under the action of the symmetric group.
The extraction of blood vessels from retinal images is an important and challenging task in medical analysis and diagnosis. This paper presents a novel hybrid automatic approach for the extraction of retinal image vessels. The method consists in the application of mathematical morphology and a fuzzy clustering algorithm followed by a purification procedure. In mathematical morphology, the retinal image is smoothed and strengthened so that the blood vessels are enhanced and the background information...
This paper compares five small area estimators. We use Monte Carlo simulation in the context of both artificial and real populations. In addition to the direct and indirect estimators, we consider the optimal composite estimator with population weights, and two composite estimators with estimated weights: one that assumes homogeneity of within area variance and squared bias and one that uses area-specific estimates of variance and squared bias. In the study with real population, we found that among...
If n independent observations, categorized according to three schemes with two categories in each scheme, have been taken, it is customary to summarize the data in a 2 x 2 x 2 contingency table (...)
In this paper we consider an exploratory canonical analysis approach for multinomial population based on the -divergence measure. We define the restricted minimum -divergence estimator, which is seen to be a generalization of the restricted maximum likelihood estimator. This estimator is then used in -divergence goodness-of-fit statistics which is the basis of two new families of statistics for solving the problem of selecting the number of significant correlations as well as the appropriateness...
Let and be stationarily cross-correlated multivariate stationary sequences. Assume that all values of and all but one values of are known. We determine the best linear interpolation of the unknown value on the basis of the known values and derive a formula for the interpolation error matrix. Our assertions generalize a result of Budinský [1].
We consider a finite mixture of Gaussian regression models for high-dimensional heterogeneous data where the number of covariates may be much larger than the sample size. We propose to estimate the unknown conditional mixture density by an ℓ1-penalized maximum likelihood estimator. We shall provide an ℓ1-oracle inequality satisfied by this Lasso estimator with the Kullback–Leibler loss. In particular, we give a condition on the regularization parameter of the Lasso to obtain such an oracle inequality....