indexingimplementationandindexingmodels内容摘要:

ations  less so for mercial usage where query loads are lighter Intelligent Information Retrieval 24 Retrieval From Indexes  Given the large indexes in IR applications, searching for keys in the dictionaries bees a dominant cost  Two main choices for dictionary data structures: Hashtables or Trees  Using Hashing  requires the derivation of a hash function mapping terms to locations  may require collision detection and resolution for nonunique hash values  Using Trees  Binary search trees  nice properties, easy to implement, and effective  enhancements such as B+ trees can improve search effectiveness  but, requires the storage of keys in each internal node Hashtables  Each vocabulary term is hashed to an integer  (We assume you’ve seen hashtables before)  Pros:  Lookup is faster than for a tree: O(1)  Cons:  No easy way to find minor variants:  judgment/judgement  No prefix search [tolerant retrieval]  If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything Sec. 25 Trees  Simplest: binary tree  More usual: Btrees  Trees require a standard ordering of characters and hence strings … but we typically have one  Pros:  Solves the prefix problem (., terms starting with hyp)  Cons:  Slower: O(log M) [and this requires balanced tree]  Rebalancing binary trees is expensive  But Btrees mitigate the rebalancing problem Sec. 26 Root am nz ahu hym nsh siz Tree: binary tree Sec. 27 Tree: Btree  Definition: Every internal node has a number of children in the interval [a,b] where a, b are appropriate natural numbers, ., [2,4]. ahu hym nz Sec. 28 Intelligent Information Retrieval 29 Recall: Steps in Basic Automatic Indexing  Parse documents to recognize structure  Scan for word tokens  Stopword removal  Stem words  Weight words Intelligent Information Retrieval 30 Indexing Models (aka “Term Weighting”)  Basic issue: which terms should be used to index a document, and how much should it count?  Some approaches  binary weights  Terms either appear or they don’t。 no frequency information used.  term frequency  Either raw term counts or (more often) term counts divided by total frequency of the term across all documents  (inverse document frequency model)  Term discrimination model  Signaltonoise ratio (based on information theory)  Probabilistic term weights Intelligent Information Retrieval 31 Binary Weights  Only the presence (1) or absence (0) of a term is included in the vector d o c s t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1D 1 0 0 1 1D 1 1 1 0 1This representation can be particularly useful, since the documents (and the query) can be viewed as simple bit strings. This allows for query operations be performed using logical bit operations. Intelligent Information Retrieval 32 Binary Weights: Matching of Documents amp。 Queries d o c s t1 t2 t3 R a n k =Q . D iD1 1 0 1 2D2 1 0 0 1D3 0 1 1 2D4 1 0 0 1D5 1 1 1 3D6 1 1 0 2D7 0 1 0 1D8 0 1 0 1D9 0 0 1 1D 1 0 0 1 1 2D 1 1 1 0 1 2Q 1 1 1q1 q2 q3D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 t2 t3 t1  In the case of binary weights, matching between documents and queries can be seen as the size of the intersection of two sets (of terms): |Q  D|. This in turn can be used to rank the relevance of documents to a query. Intelligent Information Retrieval 33 Beyond Binary Weight d o c s t1 t2 t3 R a n k =Q . D iD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3D 1 0 0 1 1 5D 1 1 1 0 1 3Q 1 2 3q1 q2 q312, , , nX x x x12, , , nY y y y More generally, similarity between the query and the document can be seen as the dot product of two vectors: Q  D (this is also called simple matching)  Note that if both Q and D are binary this is the same as: |Q  D| Given two vectors X and Y: Simple matching measures the similarity between X and Y as the dot product of X and Y:   i ii yxYXYXs i m ),(Intelligent Information Retrieval 34 Raw Term Weights  The frequency of occurrence for the term in each document is included in the vector d o c s t1 t2 t3 R S V =Q . D iD1 2 0 3 11D2 1 0 0 1D3 0 4 7 29D4 3 0 0 3D5 1 6 3 22D6 3 5 0 13D7 0 8 0 16D8 0 10 0 20D9 0 0 1 3D 1 0 0 3 5 21D 1 1 4 0 1 7Q 1 2 3q1 q2 q3Now the notion of simple matching (dot product) incorporates the term weights from both the query and the documents. Using raw term weights provides the ability to better distinguish among retrieved documents Note: Although “term frequency” is monly used to mean raw occurrence count, technically it implies that raw count is divided by the document length (total no. of term occurrences in the document). Term Weights: TF  More frequent terms in a document are more important, . more indicative of the topic. fij = frequency of term i in document j.  May want to normalize term frequency (tf) by dividing by the frequency of the most mon term in the document: tfij = fij / maxi{fij}  Or sublinear tf scaling: tfij = 1 + log fij 35 Intelligent Information Retrieval 36 Normalized Similarity Measures  With or without normalized weights, it is possible to incorporate normalization into various similarity measures  Example (Vector Space Model)  in simple matching, the dot product of two vectors measures the similarity of these vectors  the normalization can be achieved by dividing the dot product by the product of the norms of the two vectors  given a vector the norm of X is:  the similarity of vectors。
阅读剩余 0%
本站所有文章资讯、展示的图片素材等内容均为注册用户上传(部分报媒/平媒内容转载自网络合作媒体),仅供学习参考。 用户通过本站上传、发布的任何内容的知识产权归属用户或原始著作权人所有。如有侵犯您的版权,请联系我们反馈本站将在三个工作日内改正。