Lucene 2.4とLucene 2.0のインデックス構築速度比較


LuceneSenna

Lucene 2.0使Lucene 2.0使Lucene

使Lucene 2.4.12009/03Lucene 2.0.02006/05


: Wikipedia20

:

: Store.YES, Index.ANALYZED

: Store.COMPRESS, Index.ANALYZED


: CJKAnalyzerbi-gram

: setMaxBufferedDocs(10000)



バージョン Lucene 2.4.1 Lucene 2.0.0
インデックス構築時間 377秒 994秒
処理記事数(1秒あたり) 530.5 201.2
インデックスサイズ 1.04GB 0.97GB

Lucene 2.4.12.6Lucene

使Lucene 2.4.12.0.0
public class LuceneBenchmark {
    public static void main(String[] args) throws Exception {
        File dataFile = new File("C:/data/jawiki-20090124-pages-articles.xml");
        File indexDir = new File("C:/data/index");
        if (indexDir.exists()) {
            FileUtils.deleteDirectory(indexDir);
        }

        Analyzer analyzer = new CJKAnalyzer();
        IndexWriter indexWriter = new IndexWriter(indexDir, analyzer,
                MaxFieldLength.UNLIMITED);
        indexWriter.setMaxBufferedDocs(10000);

        XMLInputFactory factory = XMLInputFactory.newInstance();
        XMLStreamReader xmlReader = factory
                .createXMLStreamReader(new FileInputStream(dataFile));

        String title = null;
        int indexSize = 0;

        long start = System.currentTimeMillis();

        for (; xmlReader.hasNext(); xmlReader.next()) {
            if (!xmlReader.isStartElement()) {
                continue;
            }
            String elemName = xmlReader.getName().getLocalPart();
            if (elemName.equals("title")) {
                title = xmlReader.getElementText();
            } else if (elemName.equals("text")) {
                String body = xmlReader.getElementText();

                Document doc = new Document();
                doc.add(new Field("title", title, Store.YES, Index.ANALYZED));
                doc.add(new Field("body", body, Store.COMPRESS, Index.ANALYZED));
                indexWriter.addDocument(doc);

                indexSize++;
                if (indexSize % 500 == 0) {
                    System.out.println(indexSize);
                    if (indexSize >= 200000) {
                        break;
                    }
                }
            }
        }

        xmlReader.close();
        indexWriter.close();

        long elapsed = System.currentTimeMillis() - start;
        System.out.println((elapsed / 1000) + " sec.");
    }
}