User Guide: Introduction

Problems of Data Analysis in CR Physics
Statistical inference in CR physics
ANI Strategy

Selection of the best subset of measured variables

Bayesian classification

Estimation Problem

Neural Net Solutions

Limitations and Perspectives of Development

4. Ani Manual 09-2019
5. B.IN Description

1. Problems of Data Analysis in CR Physics

The observational evidence for the existence of very high energy cosmic ray point courses has reanimated the substantial interest in cosmic ray physics and initiated construction of very large experimental complexes. Arrays of particle detectors covering a large area are measuring different parameters of numerous secondary products of the primary cosmic ray interactions with the atmosphere. Only a simultaneous measurement of a large number of independent parameters in each individual event can yield reliable information to reconstruct the primary particle origin and its energetic characteristics as well as the peculiarities of strong interaction at the top of atmosphere.

The ambiguity of interpretation of the results of experiments with cosmic rays is connected with both significant gaps in our knowledge of the characteristics of hadron-nuclear interactions at uperaccelerator energies and indefiniteness of the primary cosmic ray composition, as well as - with strong fluctuations of all shower parameters. The extra difficulties are due to indirect experiments and hence, due to he use of Monte - Carlo simulations of development and detection of different components of nuclear electromagnetic cascades.

To make the conclusions about the investigated physical phenomenon more reliable and significant, it is necessary to develop a unified theory of statistical inference, based on nonparametric models, in which various nonparametric approaches (density estimation, Bayesian decision making, error rate estimation, feature extraction, sample control during handling, neural net models, etc...) would be used.

The most important part of the present approach is the quantitative comparison of multivariate distributions and use of a nonparametric technique to estimate the probability density in the multidimensional feature space. As compared to the earlier used methods of inverse problem solution, in the proposed project the object of analysis is each particular event (a point in the multivariate space of measured parameters) rather than alternative distributions of model and experimental data. That is why, along with the averaged characteristics, the belonging of each event to a certain class is determined (statistical decision problem).

This approach was first used to estimate the upper limit of the iron nuclei fraction according to the gamma - family characteristics, egistered by PAMIR collaboration [1]. Our results confirm the " normal composition " hypothesis of primary cosmic rays - 40% protons and 20% iron nuclei, and reject the hypothesis of dominance of iron nuclei in PCR at E > or = 10 PeV.

The handling of EAS simulation data proves, that the methodology allows us:

to find the most close to experiment model from the bank of possible models;
to determine with 70-80% efficiency the type of the primary and to estimate its energy with an accuracy of 25% [2];
to obtain model -independent estimates of some parameters of the strong interaction at superaccelerator energies by the characteristics of the electron - photon and the muon component of EAS.

In particular, the possibility of estimation of the inelastic cross section of interaction of protons with the air nuclei and of the coefficients of inelasticity in P-A interactions was investigated.These results give us hope that conduction of a target experiment in cosmic rays and the study of the characteristics of P-A and A-A interactions in the energy range 1 - 100 PeV is possible [3].

A multidimensional analysis was applied for classification of the Cherenkov light images of air showers registered by the Whipple observatory and HEGRA collaboration air Cherenkov telescopes with the multichannel light receiver. The differences in the angular size of the image, its orientation and position in the focal plane of the receiver were used in the analysis to distinguish the showers induced by primary gamma - rays from showers induced by background cosmic hadrons. It was shown that the use of several image parameters together with their correlations can lead to a reduction of the background rejection down to a few tenths of a percent while retaining about 50% of useful (gamma-rays induced) events.

The application of multidimensional technique to the famous Crab detection data file (Whipple observatory - 1988-1989 ), proves the advantage of the new background suppression technique and - achievement of considerable enhancement of source detection significance [4].

2. Statistical inference in CR physics

The most difficult and most important part of high energy physics data analysis is the comparison of competitive hypotheses and decision making on the nature of the investigated physical phenomenon.

In the cosmic ray physics the main technique of statistical inference, connected the with problem of determination of initial physical parameters (such as mass composition and energy spectrum of PCR, strong interaction characteristics, flux of very high energy gamma - quanta from point sources, etc...), is the direct problem solution with detailed simulation of the cosmic ray traversal of the atmosphere and the experimental installation with a following comparison of the multivariate simulation and experimental data. Actually, an algorithm is constructed, which describes EAS development and registration of its different components on the observation level, which is based on a certain model of the process investigated, i.e. the set of the parameters that characterize the PCR flux and interaction of hadrons and nuclei and gamma-quanta with the air nuclei.

By simulations with different models and comparing the experimental and model data, a class of models is selected, which describe the experimental data satisfactorily. Such an approach allows us to discard a certain class of non-satisfactory models, but the available experimental data do not allow one to select the only model among the many proposed, as the mass composition and energy spectrum of PCR and the characteristics of hadron - nucleus interactions at E> or = 1PeV are unknown.

For almost all problems of inference, the crucial question is whether the fitted probability family is in fact consistent with data. Usually parametric models are chosen for their statistical tractability, rather than for their appropriateness to the real process being studied.

Of course, any statistical inference is conditioned on the model used, and, if the model is oversimplified, so that essential details are ever omitted, or improperly defined, at best only qualitative conclusions may be done. Now, in cosmic ray and accelerator physics very sophisticated models are used, completely mimicked a stochastic mechanism whereby data are generated. Such models are defined on a more fundamental level than parametric models, and provide us with a wide range of outcomes from identical input variable sets, so called,"labeled",or "training samples (TS).These sets of events with known membership represent the general - nonparametric mode of a priori information.

Though simulation in data analysis in high energy physics is widely used, we can aware of a very few systematic investigations of theoretical aspects about how data may be compared with their simulated counterpart. What we need is a well defined technique, what one can call Monte - Carlo Inference.

The presented approach considers the classification and hypothesis testing problems in the framework of Bayesian paradigm and the main steps of the unified data analysis methodology areas following.

3. ANI Strategy

3.1 Selection of the best subset of measured variables

Determination of intrinsic dimensionality of data - two dimensionality estimation methods are used - the average of local dimensionality, and the correlation dimensionality.

Then a best data subset (in the sense of discriminative value) is selected - proceeding from the initial dimensionality, on each step of dimensionality reduction a "worst" feature is selected, according to reduction of Bhatacharya distance, calculated for each variant of the subsamples obtained.

The quantitative comparison of variables is done by means of the so called P-values of statistical test, showing the relative discriminative value of the variable. The greater this value, the smaller the probability of the H_{0} hypothesis to be correct. H_0 consists in the statement, that the two distributions compared come from the same population. The smaller this probability, one can reject this hypothesis with greater confidence and accept the alternative hypothesis: that two samples come from different populations. And the "distance" between populations is in some sense proportional to the P-value. Three different tests are used: the parametric Student test, the nonparametric Kolmogorov - Smirnov test and also nonparametric Mahn - Uhithny test.

The covariation analysis can help in best feature pairs selection. The Fisher matrix big values indicate, that the apparent difference in pair wise correlation in different classes is significant and the greater these values, the greater the discriminative power of this particular pair. The correlation matrices present the variables correlations and help one to perform the alternative procedure of subset selection. One can proceed from the one, or more, best pairs, then add the best single variable, trying to choose such, that isn't correlated with the previously chosen ones.

The best subsets can be selected also by the so-called Bayes risk values - the mean losses (errors) committed by optimal Bayesian classificator during "one-leave-out-for-a-time" test over training sample. The decision making is performed by means of multidimensional probability density estimation. The adaptation of K Nearest Neighbors and Parzen window methods helps to obtain density estimates, better than with any prechosen fixed parameter. Several estimates, corresponding to a set of method parameters are calculated simultaneously. The tuning of parameters can be done, by taking a broad subset in the beginning and subsequently narrowing it by examining the Bayes risk values.

3.2 Bayesian classification

Bayesian approach provides the general method of incorporating of prior and experimental information.

Decision rule, that assigns observable v to the class with the highest a posteriory density (Bayesian decision rule), takes into account all useful information and all possible losses due to any wrong decision.

The posterior density is basis of statistical decisions on particle type and on simulation and experimental data closeness. The term closeness refers to the degree of coincidence, similarity, correlation, overlapping or any such variable. Most convenient closeness measure, commonly used in pattern recognition problems for feature extraction, is the Bayes error (Bayes risk for simpleloss function). Bayes classifier provides minimum probability of error among all classifiers for the same feature set.

However, it is impossible to calculate Bayes risk and other distance measures, as the analytic expression of conditional densities and, hence, the posterior ones, is unknown. Therefore, we are obliged to use their nonparametric estimates.

Nonparametric in the sense, that density function is not a particular member of a previously choosed parametric distribution family, but an estimate based only on sample information and - on very mild conditions on the underlying density (usually only continuity).

In ANI package is used most popular kerneltype estimates, introduced by Rosenblatt and studied by Parzen. This density estimates enables effective control of the degree of smoothing of empirical distributions.

In ANI the procedure of optimal method parameter value selection is implemented by using such a surprisingly powerful technique as ordered statistics.

This estimates uses more detailed information on the neighborhood of point v and are more stable.

For each point, where density have to be estimated, it will be done a unique choice of method parameter value. In the middle of ordered sequence it will appear most stable member of the sequence. This, self - adjusting character of estimates, as we shall see further, leads to estimates, better, than one can obtain with any fixed parameter value.

3.3 Estimation Problem

The nonparametric regression is used for parameter estimation purposes. The peculiarity of solution of the regression problem in the cosmic - ray physics is the fact that neither the true spectrum f(E) nor the conditional density P(V/E) are known in the general case, but there is a training sequence {E ,U } (obtained by simulation) and it is required to "recover" the regression E=E(V) by this sequence.

Due to the complicated stochastic picture of particles and nuclei passing through the atmosphere and the detectors, we have not to expect a standard probability interpretation of all random processes, that is why we have chosen a method based on a nonparametric way of treatment of a priori information, which does not impose any structure and totally uses the information carried by TS.

The method is based on the obvious fact that the events close to some metric (usually the Mahalanobis metric is used) in the feature space have similar energy - the compactness hypothesis.

3.4 Neural Net Solutions

The alternative very powerful classification and estimation technique is connected with the development of mathematical models of Neural Nets (NN). The input layer of the feed-forward network have one node for each feature, signal processing is performed layer to layer beginning from the input. Neurons of successive layers receive input only from neurons of the previous layer and each neuron in a given layer sends it's output to all nodes in the next layer. The output layer has single node (it is enough for data classification), determining the value of the discrimination function.

Such data handling design, combining the linear summation on the nodes input, and nonlinear transformation in the nodes, allows us to take into account all distinctive information, including differences in nonlinear correlations of alternative classes of multidimensional features.

The training is performed with simulated data or (and) calibration results. The initial values of net parameters are chosen randomly from Gaussian population with zero mean and not large variance. In such a randomly trained net the output for arbitrary input set, despite its belonging to the particular type, is near 0.5, and the classification is impossible. The training of NN consists in multiple processing of all training samples with iterative modifications of connection coefficients.

The quality function minimization is usually done by the back propagation method - gradient descend is performed on the quality function with respect to the weights in order to minimize the deviations of the network response from the desired "true" response. The main drawback of such methods is their convergence to local minimum, in contrast, the random search in the net parameters space allows to escape from the local minimum region and continue the search until global minimum is reached.

A common complaint about these techniques is the dependence of the final classification scheme on the purity and finiteness of training sets (small training samples effects). However, due to the inherent robust characteristics of NN, classifiers results from NN analyses are relatively insensitive to modest impurities in the training sets.

Exploiting the tolerance of the NN methodology and assuming that the measured event description variables contain sufficient information to separate the processes, the NN methodology can be expected to provide the possibility of outlining a new physics in the cosmic ray experiments.

3.5 Limitations and Perspectives of Development

The potential difficulties and limitations of the ANI package are connected with model dependence of statistical inference. The question of correctness of the model itself is always open and we need a more general procedure to check the model validity and obtain physical results not so crucially depending on the prechosen models.

One possibility of model - independent inference is connected with cluster analysis - to scan the multidimensional feature space to find singularities of probabilistic measure - but difficulties will encounter with interpretation of embedded structures and with estimation of desired physical quantities.

In the framework of the present program a new approach to obtaining a model - independent inference has to be developed.

Proceeding from the bank of possible models - the mixture of simulation trials corresponding to different particles, energies and strong interaction parameters, each with one fuzzy coefficient - one can organize a subsequent optimization procedure, simultaneously tuning both the astrophysical (chemical composition, energy spectra) and strong interaction parameters.

Till now, in spite of extensive development of NN theory and applications, many important theoretical problems are far from being solved. Only very few quantitative results are available.

There are several practice problems to be solved:

Selection of the learning rules for different applied problems;
Modeling and introduction of "smart" neurons in NN;
Investigation of the influence of the couplings determination accuracy on the NN performance;
Investigations of the role of the shape of the nonlinear output function and of the number of nodes in hidden layer on the sensitivity of NN classifier and to the finiteness of TS;
Designing fast training algorithms which minimize the true error (on test sample) instead of minimizing the apparent error (on training sample);
Multiple comparisons of random search and backpropagation algorithms for a variety of initial parameters.

Along with the outlined NN research program, it is proposed to work out a complete network development platform, including:

the kits of NN optimal type and topology design;
the software package implementing the multivariate data preselection;
the NN software simulators with a number of alternative training methodologies.