Study of decision tree algorithms: effects of air pollution on under five mortality in Ulaanbaatar

[ad_1]

Method and materials

Classification algorithms

Classification is a form of data analysis that extracts models describing important data classes. Such models, called classifiers, predict categorical class labels. Data classification is a two-step process, consisting of a learning step and a classification step.

In the first step, a classifier is built describing a predetermined set of data classes or concepts. This is the learning step, where a classification algorithm builds the classifier that analyses from a training set made up of database tuples and their associated class labels. A tuple, X, is represented by an n-dimensional attribute vector, depicting n measurements made on the tuple from n database attributes, respectively, A₁, A₂, …, A_n. Each tuple, X, is concerned to belong to a predefined class as determined by another database attribute called the class label attribute.

In the second step, the model is used for classification. First, the predictive accuracy of the classifier is estimated. If we were to use the training set to measure the classifier’s accuracy, this estimate would likely be optimistic, because the classifier tends to overfit the data. Therefore, a test set is used, made up of test tuples and their associated class labels. They are independent of the training tuples, meaning that they were not used to construct the classifier. The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier. The associated class label of each test tuple is compared with the learned classifier’s class prediction for that tuple.7

In this study, we have studied decision tree induction.

Decision tree

Decision tree induction is the learning of decision trees from class labelled training tuples. A decision tree is a flow chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label.

During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning, developed a decision tree algorithm known as ID3. Quinlan later presented C5.0, C4.5 of a successor of ID3, which became a benchmark to which newer supervised learning algorithms are often compared.7

C5.0

The C5.0 algorithm is an extension of the C4.5 algorithm and the model work by splitting the sample based on the field with the maximum information gain.8 Each subsample defined by the first split is then split again, based on a different field, and the process repeats until the subsamples cannot be split any further. Then the lowest level splits are reexamined and models are pruned that do not contribute significantly to the value.8 9

C5.0 uses the concept of entropy for measuring purity and it expresses homogeneous change of class attribute in the dataset. The minimum value 0 indicates completely homogenous, 1 indicates the maximum impurity for the sample.10 11 The definition of entropy can be specified as

$Display Formula$

In equation 1, for a given segment S of dataset, the c refers to the number of different class levels, and $Inline Formula$ refers to the proportion of values in the class level i. In this study, we have a dataset with two classes, 55% is for the low state and 45% is in the high state of mortality. When the entropy was calculated, it was 0.99.

$Display Formula$

After calculating the entropy that depends on the class attribute, the algorithm must decide which feature to split on and for that calculate entropy that expresses the change in homogeneity resulting from a split on each possible feature and this calculation is named as information gain. The information gain of the feature F is calculated as the difference between the entropies the segment $Inline Formula$ before the split and the segments $Inline Formula$ resulting from the split.10 11 That is,

$Display Formula$

After that split, the dataset is divided into more than one partition. Therefore, the function to $Inline Formula$ needs to consider the total entropy in all the partitions. The entropy generated from each partition is weighted by the proportion of records in that partition, which can be expressed by the following formula

$Display Formula$

A feature with the higher the information gain is at creating homogenous groups after a split on that feature.

CART

For Classification and Regression Tree, the Gini index is used and is concerned measure the impurity of D, a data partition or set of training tuples, as

$Display Formula$

Where $Inline Formula$ is the probability that a tuple in D belongs to class $Inline Formula$ and is estimated by $Inline Formula$ . The sum is computed over m classes.7 12

The Gini index considers a binary split for each attribute. A feature is a discrete-valued attribute having v distinct values, $Inline Formula$ , occurring in D. To determine the best binary split on A, examine first all the possible subsets that can be formed using known values of A.7

When a binary split is made, will be computed a weighted sum of the impurity of each resulting partition. If a binary split on A split D into D1 and D2, the index of D given that partitioning I

$Display Formula$

If tuple has a discrete-valued attribute, the subset that gives the minimum Gini index for that attribute is selected as its splitting subset.7

For continuous-valued attributes, each possible split-point must be considered. The strategy is similar to the information gain, where the midpoint between each pair of adjacent values is taken as a possible split-point. The point with the minimum Gini index for the continuous-valued attribute is taken as the split-point of that attribute. For a possible split-point of A, $Inline Formula$ is the set of tuples in D satisfying $Inline Formula$ , and $Inline Formula$ is the set of tuples in D satisfying $Inline Formula$ . The reduction in impurity that would be incurred by a binary split on a discrete-valued or continuous-valued attribute A is