第2课数据预处理技术内容摘要:

alytical results  Data reduction strategies  Data cube aggregation(数据立方体聚集)  Dimensionality reduction—remove unimportant attributes  Data Compression  Numerosity reduction—fit data into models  Discretization and concept hierarchy generation Data Cube Aggregation  The lowest level of a data cube  the aggregated data for an individual entity of interest  ., a customer in a phone calling data warehouse.  Multiple levels of aggregation in data cubes  Further reduce the size of data to deal with  Reference appropriate levels  Use the smallest representation which is enough to solve the task  Queries regarding aggregated information should be answered using data cube, when possible Dimensionality Reduction  Feature selection (., attribute subset selection):  Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features  reduce of patterns in the patterns, easier to understand  Heuristic methods (due to exponential of choices):  stepwise forward selection(逐步向前选择)  stepwise backward elimination(逐步向后删除)  bining forward selection and backward elimination  decisiontree induction Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 1 Class 2 Class 1 Class 2 Reduced attribute set: {A1, A4, A6} Data Compression  String pression  There are extensive theories and welltuned algorithms  Typically lossless  But only limited manipulation is possible without expansion  Audio/video pression  Typically lossy pression, with progressive refinement  Sometimes small fragments of signal can be reconstructed without reconstructing the whole  Time sequence is not audio  Typically short and vary slowly with time Data Compression Original Data Compressed Data lossless Original Data Approximated Wavelet Transformation  Discrete wavelet transform (DWT): linear signal processing, multiresolutional analysis  Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients  Similar to discrete Fourier transform (DFT), but better lossy pression, localized in space  Method:  Length, L, must be an integer power of 2 (padding with 0s, when necessary)  Each transform has 2 functions: smoothing, difference  Applies to pairs of data, resulting in two set of data of length L/2  Applies two functions recursively, until reaches the desired length Haar2 Daubechie4  Given N data vectors from kdimensions, find c = k orthogonal vectors that can be best used to represent data  The original data set is reduced to one consisting of N data vectors on c principal ponents (reduced dimensions)  Each data vector is a linear bination of the c principal ponent vectors  Works for numeric data only  Used when the number of dimensions is large Principal Component Analysis X1 X2 Y1 Y2 Principal Component Analysis Numerosity Reduction(数值归约)  Parametric methods  Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)  Loglinear models: obtain value at a point in mD space as the product on appropriate marginal subspaces  Nonparametric methods  Do not assume models  Major families: histograms, clustering, sampling Regression and LogLinear Models  Linear regression: Data are modeled to fit a straight line  Often uses the leastsquare method to fit the line  Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector  Loglinear model: approximates discrete multidimensional probability distributions  Linea。
阅读剩余 0%
本站所有文章资讯、展示的图片素材等内容均为注册用户上传(部分报媒/平媒内容转载自网络合作媒体),仅供学习参考。 用户通过本站上传、发布的任何内容的知识产权归属用户或原始著作权人所有。如有侵犯您的版权,请联系我们反馈本站将在三个工作日内改正。