textmining文本挖掘课件(编辑修改稿)内容摘要:

Not First Stories = Topic 1 = Topic 2 The FirstStory Detection Task To detect the first story that discusses a topic, for all topics. First Story Detection  New event detection is an unsupervised learning task  Detection may consist of discovering previously unidentified events in an accumulated collection – retro  Flagging onset of new events from live news feeds in an online fashion  Lack of advance knowledge of new events, but have access to unlabeled historical data as a contrast set  The input to online detection is the stream of TDT stories in chronological order simulating realtime ining documents  The output of online detection is a YES/NO decision per document Approach 1: KNN  Online processing of each ining story  Compute similarity to all previous stories  Cosine similarity  Language model  Prominent terms  Extracted entities  If similarity is below threshold: new story  If similarity is above threshold for previous story s: assign to topic of s  Threshold can be trained on training set  Threshold is not topic specific! Approach 2: Single Pass Clustering  Assign each ining document to one of a set of topic clusters  A topic cluster is represented by its centroid (vector average of members)  For ining story pute similarity with centroid Patterns in Event Distributions  News stories discussing the same event tend to be temporally proximate  A time gap between burst of topically similar stories is often an indication of different events  Different earthquakes  Airplane accidents  A significant vocabulary shift and rapid changes in term frequency are typical of stories reporting a new event, including previously unseen proper nouns  Events are typically reported in a relatively brief time window of 1 4 weeks Similar Events over Time Approach 3: KNN + Time  Only consider documents in a (short) time window  Compute similarity in a time weighted fashion:  m: number of documents in window, di: ith document in window  Time weighting significantly increases performance. FSD Results Discussion  Hard problem  Bees harder the more topics need to be tracked.  Second Story Detection much easier that First Story Detection  Example:  retrospective detection of first 9/11 story easy,  online detection hard References  Online New Event Detection using SinglePass Clustering, Papka, Allan (University of Massachusetts, 1997)  A study on Retrospective and OnLine Event Detection, Yang, Pierce, Carbonell (Carnegie Mellon University, 1998)  Umass at TDT2020, Allan, Lavrenko, Frey, Khandelwal (Umass, 2020)  Statistical Models for Tracking and Detection, (Dragon Systems, 1999) Summarization What is a Summary?  Informative summary  Purpose: replace original document  Example: executive summary  Indicative summary  Purpose: support decision: do I want to read original document yes/no?  Example: Headline, scientific abstract Why Automatic Summarization?  Algorithm for reading in many domains is:  read summary  decide whether relevant or not  if relevant: read whole document  Summary is gatekeeper for large number of documents.  Information overload  Often the summary is all that is read.  Humangenerated summaries are expensive. Summary Length (Reuters) Goldstein et al. 1999 Summarization Algorithms  Keyword summaries  Display most significant keywords  Easy to do  Hard to read, poor representation of content  Sentence extraction  Extract key sentences  Medium hard  Summaries often don’ t read well  Good representation of content  Natural language understanding / generation  Build knowledge representation of text  Generate sentences summarizing content  Hard to do well  Something between the last two methods? Sentence Extraction  Represent each sentence as a feature vector  Compute score based on features  Select n highestranking sentences  Present in order in which they occur in text.  Postprocessing to make summary more readable/concise  Eliminate redundant sentences  Anaphors/pronouns  Delete subordinate clauses, parentheticals Sentence Extraction: Example  Sigir95 paper on summarization by Kupiec, Pedersen, Chen  Trainable sentence extraction  Proposed algorithm is applied to its own description (the paper) Sentence Extraction: Example Feature Representation  Fixedphrase feature  Certain phrases indicate summary, . “ in summary” , “ in conclusion” etc.  Paragraph feature  Paragraph initial/final more likely to be important.  Thematic word feature  Repetition is an indicator of importance  Do any of the most frequent content words occur?  Uppercase word feature  Uppercase often indicates named entities. (Taylor)  Is uppercase thematic word introduced?  Sentence length cutoff  Summary sentence should be 5 words.  Summary sentences have a minimum length. Training  Handlabel sentences in training set (good/bad summary sentences)  Train classifier to distinguish good/bad summary sentences  Model used: Na239。 ve Bayes  Can rank sentences according to score and show top n to user. Evaluation  Compare extracted sentences with sentences in abstracts Evaluation of features  Baseline (choose first n sentences): 24%  Overall performance (4244%) not very good.  However, there is more than one good summary. DUC  DUC: government sponsored bakeoff to further progress in summarization and enable researchers to participate in largescale experiments. Multidocuments Summarization Newsblaster (Columbia) QuerySpecific Summarization  So far, we’ ve look at generic summaries.  A generic summ。
阅读剩余 0%
本站所有文章资讯、展示的图片素材等内容均为注册用户上传(部分报媒/平媒内容转载自网络合作媒体),仅供学习参考。 用户通过本站上传、发布的任何内容的知识产权归属用户或原始著作权人所有。如有侵犯您的版权,请联系我们反馈本站将在三个工作日内改正。