第2课数据预处理技术内容摘要:
alytical results Data reduction strategies Data cube aggregation(数据立方体聚集) Dimensionality reduction—remove unimportant attributes Data Compression Numerosity reduction—fit data into models Discretization and concept hierarchy generation Data Cube Aggregation The lowest level of a data cube the aggregated data for an individual entity of interest ., a customer in a phone calling data warehouse. Multiple levels of aggregation in data cubes Further reduce the size of data to deal with Reference appropriate levels Use the smallest representation which is enough to solve the task Queries regarding aggregated information should be answered using data cube, when possible Dimensionality Reduction Feature selection (., attribute subset selection): Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features reduce of patterns in the patterns, easier to understand Heuristic methods (due to exponential of choices): stepwise forward selection(逐步向前选择) stepwise backward elimination(逐步向后删除) bining forward selection and backward elimination decisiontree induction Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 1 Class 2 Class 1 Class 2 Reduced attribute set: {A1, A4, A6} Data Compression String pression There are extensive theories and welltuned algorithms Typically lossless But only limited manipulation is possible without expansion Audio/video pression Typically lossy pression, with progressive refinement Sometimes small fragments of signal can be reconstructed without reconstructing the whole Time sequence is not audio Typically short and vary slowly with time Data Compression Original Data Compressed Data lossless Original Data Approximated Wavelet Transformation Discrete wavelet transform (DWT): linear signal processing, multiresolutional analysis Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients Similar to discrete Fourier transform (DFT), but better lossy pression, localized in space Method: Length, L, must be an integer power of 2 (padding with 0s, when necessary) Each transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two set of data of length L/2 Applies two functions recursively, until reaches the desired length Haar2 Daubechie4 Given N data vectors from kdimensions, find c = k orthogonal vectors that can be best used to represent data The original data set is reduced to one consisting of N data vectors on c principal ponents (reduced dimensions) Each data vector is a linear bination of the c principal ponent vectors Works for numeric data only Used when the number of dimensions is large Principal Component Analysis X1 X2 Y1 Y2 Principal Component Analysis Numerosity Reduction(数值归约) Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) Loglinear models: obtain value at a point in mD space as the product on appropriate marginal subspaces Nonparametric methods Do not assume models Major families: histograms, clustering, sampling Regression and LogLinear Models Linear regression: Data are modeled to fit a straight line Often uses the leastsquare method to fit the line Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector Loglinear model: approximates discrete multidimensional probability distributions Linea。第2课数据预处理技术
本资源仅提供20页预览,下载后可查看全文
阅读剩余 0%
本站所有文章资讯、展示的图片素材等内容均为注册用户上传(部分报媒/平媒内容转载自网络合作媒体),仅供学习参考。
用户通过本站上传、发布的任何内容的知识产权归属用户或原始著作权人所有。如有侵犯您的版权,请联系我们反馈本站将在三个工作日内改正。
相关推荐
第3章汇编语言程序格式
„ .CODE START:MOV AX,@DATA MOV DS,AX „ MOV AX,4C00H INT 21H END START 24 格式:组名 GROUP 段名 [,段名 ,...] 把多个同类段合并为一个 64KB物理段,并用一个组名统一存取它 定义段组后,段组内各段就统一为一个段地址,各段定义的变量和标号的偏移地址就相对于段组基地址计算
第2讲财会书写——中文大写数字书写
要求 人民币伍佰捌拾叁元整 人民币伍佰捌拾叁元陆角整 人民币伍佰捌拾叁元陆角 正确运用“整”或“正” ※ 中文大写金额数字有分位的,分后不写“整”或“正”字 中文大写金额数字的书写要求 人民币伍佰捌拾叁元陆角柒分 正确写“零” ※ 中文数码(阿拉伯)金额数字中间连续有几个“0”时,中文大写金额数字中间可以只写一个“零”字 中文大写金额数字的书写要求 165。 人民币贰万元零叁分
第2课中国古代土地制度的基本形态
的现象频繁发生 二、以私有制为主体的多种土地所有制形式 主要形式 同时并存 君主土地私有制 地主土地所有制 自耕农土地所有制 此消彼长 关于土地兼并的四个问题 ① 主要朝代的土地兼并情况 ★ 东汉和唐朝:田庄是最普遍的大土地经营单位 ★ 宋: “ 田制不立 ” 、 “ 不抑兼并 ” ★ 明清:土地私有制进一步发展
第2讲基本数据类型与表达式
Day=+Day+\tPay=+Pay)。 (total=+total); } } 定义了三个常量 常量举例 例 一套房子每天的租金是 ,如果租 30天,试编程计算应付房租。 变量 •变量( variables)是 Java程序中的一个基本存储单元,是在程序运行过程中其 值可以改变的量。 •一个变量蕴含有三个含义: ( 1)变量的名称。 变量的名称简称变量名,变量名是用户自己定义的标识符