webmining网路探勘内容摘要:
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 23 More implementation issues • URL canonicalization – All of these: • • • • – Are really equivalent to this canonical form: • – In order to avoid duplication, the crawler must transform all URLs into canonical form – Definition of “canonical” is arbitrary, .: • Could always include port • Or only include port when not default :80 Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 24 More on Canonical URLs • Some transformation are trivial, for example: Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 25 More on Canonical URLs Other transformations require heuristic assumption about the intentions of the author or configuration of the Web server: 1. Removing default file name – This is reasonable in general but would be wrong in this case because the default happens to be ‘’ instead of ‘’ 2. Trailing directory – This is correct in this case but how can we be sure in general that there isn’t a file named ‘fil’ in the root dir? Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 26 Convert URLs to canonical forms Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 27 More implementation issues • Spider traps – Misleading sites: indefinite number of pages dynamically generated by CGI scripts – Paths of arbitrary depth created using soft directory links and path rewriting features in HTTP server – Only heuristic defensive measures: • Check URL length。 assume spider trap above some threshold, for example 128 characters • Watch for sites with very large number of URLs • Eliminate URLs with nontextual data types • May disable crawling of dynamic pages, if can detect Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 28 More implementation issues • Page repository – Na239。 ve: store each page as a separate file • Can map URL to unique filename using a hashing function, . MD5 • This generates a huge number of files, which is inefficient from the storage perspective – Better: bine many pages into a single large file, using some XML markup to separate and identify them • Must map URL to {filename, page_id} – Database options • Any RDBMS large overhead • Lightweight, embedded databases such as Berkeley DB Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 29 Concurrency • A crawler incurs several delays: – Resolving the host name in the URL to an IP address using DNS – Connecting a socket to the server and sending the request – Receiving the requested page in response • Solution: Overlap the above delays by fetching many pages concurrently Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 30 Architecture of a concurrent crawler Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 31 Concurrent crawlers • Can use multiprocessing or multithreading • Each process or thread works like a sequential crawler, except they share data structures: frontier and repository • Shared data structures must be synchronized (locked for concurrent writes) • Speedup of factor of 510 are easy this way Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 32 Universal crawlers • Support universal search engines • Largescale • Huge cost (work bandwidth) of crawl is amortized over many queries from users • Incremental updates to existing index and other data repositories Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 33 Largescale universal crawlers • Two major issues: 1. Performance • Need to scale up to billions of pages 2. Policy • Need to tradeoff coverage, freshness, and bias (. toward “important” pages) Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 34 Largescale crawlers: scalability • Need to minimize overhead of DNS lookups • Need to optimize utilization of work bandwidth and disk throughput (I/O is bottleneck) • Use asynchronous sockets – Multiprocessing or multithreading do not scale up to billions of pages – Nonblocking: hundreds of work connections open simultaneously – Polling socket to monitor pletion of work transfers。webmining网路探勘
阅读剩余 0%
本站所有文章资讯、展示的图片素材等内容均为注册用户上传(部分报媒/平媒内容转载自网络合作媒体),仅供学习参考。
用户通过本站上传、发布的任何内容的知识产权归属用户或原始著作权人所有。如有侵犯您的版权,请联系我们反馈本站将在三个工作日内改正。