scale-upxscale-out_acasestudyusingnutch_lucene-外文文献(编辑修改稿)内容摘要:
uce phase, each reducer task collects all the pairs for a given key, thus producing a single index table for that key. Once all the keys are processed, we have the plete index for the entire data set. In most search applications, query represents the vast majority of the putation effort. When performing a query, a set of index terms is presented to a query engine, which then retrieves the documents that best match that set of terms. The overall architecture of the Nutch/Lucene parallel query engine is shown in Figure 3. The query engine part consists of one or more frontends, and one or more backends. Each backend is associated with a segment of the plete data set. The driver represents external users and it is also the point at which the performance of the query is measured, in terms of queries per second (qps). throughput measurementDriverFrontendBackendBackendBackendFrontenddatadatadata Figure 3: Nutch/Lucene query. A query operation works as follows. The driver submits a particular query (set of index terms) to one of the frontends. The frontend then distributes the query to all the backends. Each backend is responsible for performing the query against its data segment and returning a list with the top documents (typically 10) that better match the query. Each document returned is associated with a score, which quantifies how good that match is. The frontend collects the response from all the backends to produce a single list of the top documents (typically 10 overall best matches). Once the frontend has that list, it contacts the backends to retrieve snippets of text around the index terms. Only snippets for the overall top documents are retrieved. The frontend contacts the backends one at a time, retrieving the snippet from the backend that had the corresponding document in its data segment. 4. Experimental results We present three different kinds of experimental results. First, we present performance counter data that allows us to characterize how applications behave at the instruction level in our JS21 PowerPC blades. We show that the behavior of the Nutch/Lucene query is similar to other standard benchmarks in the SPECcpu suite. We then report performance results from runs in our reference POWER5 p5 575 SMP machine. We pare a pure scaleup and a scaleout in a box configuration. Finally, we present scalability results from runs in the BladeCenter scaleout cluster. . Performance counter data Using the PowerPC hardware performance counters and tools from University of Toronto [1], we performed stall breakdown analysis for the Nutch/Lucene query operation. We ran a configuration with one frontend and one backend, each on a separate JS21 blade. The backend operated on 10 GB of search data. We collected performance data only for the backend, since that is the more putationally intensive ponent. Performance data was collected for a period of 120 seconds during steadystate operation of the backend, and reported secondbysecond. Figure 4 is a plot of number of instructions pleted per second during the measurement interval, by all four PowerPC processors. We observe that the number of instructions pleted stays mostly in the range of 5 to 7 billion instructions per second. Since there are four (4) PowerPC 970 processors running at GHz and the PowerPC 970 can plete up to 5 instructions per cycle, the maximum pletion rate of the JS21 is 50 billion instructions per second. Figure 5 is a plot of clocks per instruction (CPI) as a function of time. It is obtained by dividing, for each second, the number of cycles executed by all processors (10 billion) by the number of instructions executed during that second (from Figure 4). The CPI stays in the range of . We note that most of the CPI values for SPECcpu 2020 in the PowerPC 970, as reported in the literature [1], are in the range 12. We also note that the best possible CPI for the PowerPC 970 is . Figure 4: Instructions executed over time Figure 5: CPI over time. From the CPI data, we can conclude that (1) the CPI for query is very far from peak, but (2) it is within what would be expected from previous experience with SPEC. More detailed information can be obtained using a stall breakdown analysis, as shown in Figure 6. The figure is a plot of the breakdown for each 1 second period during the analysis interval. (We note that in each second there are a total of 10 billion processor cycles – 4 processors @ GHz.) The order of the ponents in the plot corresponds to the order in the legend, from bottom to top. Figure 7 shows the average over time of the data in Figure 6, separated for cycles in user mode, kernel mode, and total (all). We observe that instruction plete on only 20% of the cycles. (Multiple instructions can plete per cycle. Instructions in the PowerPC 970 plete in bundles.) From the average number of instructions executed per second (10 billion cycles/second + cycles per instruction = billion instructions/second), we conclude that the average bundle size is approximately 3 instructions (out of a maximum of 5). Figure 6: Stall breakdown for query. Figure 7: Average stall breakdown for query. Another metric we can derive is the nonstall CPI for query, puted dividing the number of nonstall (pleted) cycles by the number of instructions. That numbers es out at , which again is very similar to the nonstall CPI for SPECcpu [1]. Another important observation is that for a significant number of cyc。scale-upxscale-out_acasestudyusingnutch_lucene-外文文献(编辑修改稿)
阅读剩余 0%
本站所有文章资讯、展示的图片素材等内容均为注册用户上传(部分报媒/平媒内容转载自网络合作媒体),仅供学习参考。
用户通过本站上传、发布的任何内容的知识产权归属用户或原始著作权人所有。如有侵犯您的版权,请联系我们反馈本站将在三个工作日内改正。