Classifying High-Speed Data Streams Using Statistical Decision Trees

Mirela Teixeira Cazzolato, Marcela Xavier Ribeiro


Every day a large amount of data is collected by applications such as credit card transactions, monitoring networks and sensors. This type of data, called data streams, are generated in an automatic way, and its storage and knowledge extraction techniques differ from those used on traditional data. The classification task builds a model to describe and distinguish classes of data. In the context of data stream classification, many incremental techniques have been proposed. The existent methods tent to improve the classification accuracy as the number of processed examples increases. However, this characteristic makes the techniques conservative when the dataset is not too big, since they are dependent on the amount of data available. In this work we propose two algorithms that are not dependent on the number of examples read and that present a high accuracy and low execution time. We describe an incremental decision tree algorithm called StARMiner Tree (ST), which is based on Very Fast Decision Tree (VFDT) system, deals with numerical data and uses a method based on statistics as the heuristic to decide when to split a node, and also to choose the best attribute to be used in the test node. We also present a non-parametric version of ST called AST. We applied ST and AST in four datasets, one synthetic and three real-world, comparing their performance to the VFDT and VFDTcNB, which is an extension of VFDT and uses Naïve Bayes in the leaves. In all experiments ST and AST achieved better accuracy results, dealing well with noise data, describing the data from the earliest examples and maintaining a good execution time. The obtained results indicate that ST and AST are well-suited for data streams classification.


Automatic StARMiner Tree; classification; data stream mining; incremental decision trees; StARMiner Tree; VFDT

Full Text:


An official publication of the Brazilian Computer Society Special Interest Group on Databases.