Anna Bartkowiak got her M.S. and Ph.D. in applied mathematics at the University
of Wroclaw, and her Dr.Sci (habilitation) in the Inst. of Computer Science, PAS, Warsaw.
since 1970 she is affiliated to Wroclaw University, from 1995 as professor in Institute
of Computer Science, Wroclaw University. Presently she is retired.
Her professional interests are applications of multivariate statistical methods
in real life problems, and generally Computational Statistics, Pattern Recognition,
Data Mining, and more recently Machine Learning and Artificial Intelligence,
specifically applied in the following topics: Regression and Discriminant Analysis,
Graphical Visualisation of multivariate data, Artificial Neural Networks. She published
several books/tutorials (in Polish) and over 300 papers (in English) dealing with
the mentioned topics.
She is a Fellow of the Royal Statistical Society (London) and a member of The American
Statistical Society (ASA), International Society of Clinical Biostatistics (ISCB), The
Biometric Society (BS), The Polish Biometric Society (PTB) and The Polish Information
Processing Society (PTI).
Assessing data variables by some collective intelligence methods
Inst. of Computer Science, University of Wroclaw (retired professor)
Statistically, since Pearson, data are recorded as matrices of size nxp,
where rows contain n subjects (individuals, cases), and columns are values
of the p variables (attributes) characterizing the subjects.
When performing traditional multivariate analysis of the recorded data,
the crucial problem is: should all the p recorded variables be taken for the
analysis; may be less of them will be sufficient and some of them are
not relevant, or even an impediment.
The old saying: “the more the better” has become questionable nowadays:
too many non-relevant variables may be disturbing by introducing some
random effects into the data.
The problem to solve is composite.
I will consider it in the context of regression or classification analysis,
when dealing with directly recorded ‘variables’ (no ‘features’ derived from them).
I will concentrate on group of methods referred to as Collective Intelligence
(contains, among others, Ensemble Learning, Decision trees and Random Forests).
Specifically, I will concentrate on the Random Forests (RFs) methodology.
RFs offer some non-conventional indices of importance of variables in the context
of regression and clustering.
They work directly on original variables (not on new features derived from them).
They can work on mixed type variables, that is quantitative (numeric)
or qualitative (categorial) .
They work without assumption on the probability distribution of the variables.
They yield an internal unbiased estimate of the generalization error.
It has been shown that RFs are resistant to outliers, however not all of them are
I intend to show – on real data examples – how all this works in practice.