textmining文本挖掘课件(编辑修改稿)内容摘要:
Not First Stories = Topic 1 = Topic 2 The FirstStory Detection Task To detect the first story that discusses a topic, for all topics. First Story Detection New event detection is an unsupervised learning task Detection may consist of discovering previously unidentified events in an accumulated collection – retro Flagging onset of new events from live news feeds in an online fashion Lack of advance knowledge of new events, but have access to unlabeled historical data as a contrast set The input to online detection is the stream of TDT stories in chronological order simulating realtime ining documents The output of online detection is a YES/NO decision per document Approach 1: KNN Online processing of each ining story Compute similarity to all previous stories Cosine similarity Language model Prominent terms Extracted entities If similarity is below threshold: new story If similarity is above threshold for previous story s: assign to topic of s Threshold can be trained on training set Threshold is not topic specific! Approach 2: Single Pass Clustering Assign each ining document to one of a set of topic clusters A topic cluster is represented by its centroid (vector average of members) For ining story pute similarity with centroid Patterns in Event Distributions News stories discussing the same event tend to be temporally proximate A time gap between burst of topically similar stories is often an indication of different events Different earthquakes Airplane accidents A significant vocabulary shift and rapid changes in term frequency are typical of stories reporting a new event, including previously unseen proper nouns Events are typically reported in a relatively brief time window of 1 4 weeks Similar Events over Time Approach 3: KNN + Time Only consider documents in a (short) time window Compute similarity in a time weighted fashion: m: number of documents in window, di: ith document in window Time weighting significantly increases performance. FSD Results Discussion Hard problem Bees harder the more topics need to be tracked. Second Story Detection much easier that First Story Detection Example: retrospective detection of first 9/11 story easy, online detection hard References Online New Event Detection using SinglePass Clustering, Papka, Allan (University of Massachusetts, 1997) A study on Retrospective and OnLine Event Detection, Yang, Pierce, Carbonell (Carnegie Mellon University, 1998) Umass at TDT2020, Allan, Lavrenko, Frey, Khandelwal (Umass, 2020) Statistical Models for Tracking and Detection, (Dragon Systems, 1999) Summarization What is a Summary? Informative summary Purpose: replace original document Example: executive summary Indicative summary Purpose: support decision: do I want to read original document yes/no? Example: Headline, scientific abstract Why Automatic Summarization? Algorithm for reading in many domains is: read summary decide whether relevant or not if relevant: read whole document Summary is gatekeeper for large number of documents. Information overload Often the summary is all that is read. Humangenerated summaries are expensive. Summary Length (Reuters) Goldstein et al. 1999 Summarization Algorithms Keyword summaries Display most significant keywords Easy to do Hard to read, poor representation of content Sentence extraction Extract key sentences Medium hard Summaries often don’ t read well Good representation of content Natural language understanding / generation Build knowledge representation of text Generate sentences summarizing content Hard to do well Something between the last two methods? Sentence Extraction Represent each sentence as a feature vector Compute score based on features Select n highestranking sentences Present in order in which they occur in text. Postprocessing to make summary more readable/concise Eliminate redundant sentences Anaphors/pronouns Delete subordinate clauses, parentheticals Sentence Extraction: Example Sigir95 paper on summarization by Kupiec, Pedersen, Chen Trainable sentence extraction Proposed algorithm is applied to its own description (the paper) Sentence Extraction: Example Feature Representation Fixedphrase feature Certain phrases indicate summary, . “ in summary” , “ in conclusion” etc. Paragraph feature Paragraph initial/final more likely to be important. Thematic word feature Repetition is an indicator of importance Do any of the most frequent content words occur? Uppercase word feature Uppercase often indicates named entities. (Taylor) Is uppercase thematic word introduced? Sentence length cutoff Summary sentence should be 5 words. Summary sentences have a minimum length. Training Handlabel sentences in training set (good/bad summary sentences) Train classifier to distinguish good/bad summary sentences Model used: Na239。 ve Bayes Can rank sentences according to score and show top n to user. Evaluation Compare extracted sentences with sentences in abstracts Evaluation of features Baseline (choose first n sentences): 24% Overall performance (4244%) not very good. However, there is more than one good summary. DUC DUC: government sponsored bakeoff to further progress in summarization and enable researchers to participate in largescale experiments. Multidocuments Summarization Newsblaster (Columbia) QuerySpecific Summarization So far, we’ ve look at generic summaries. A generic summ。textmining文本挖掘课件(编辑修改稿)
阅读剩余 0%
本站所有文章资讯、展示的图片素材等内容均为注册用户上传(部分报媒/平媒内容转载自网络合作媒体),仅供学习参考。
用户通过本站上传、发布的任何内容的知识产权归属用户或原始著作权人所有。如有侵犯您的版权,请联系我们反馈本站将在三个工作日内改正。