Lucene In Action 读书笔记（一）-白红宇

Lucene In Action 读书笔记（一）

阅读量：6868 次

发布时间：2019-06-26

本文共 7570 字，大约阅读时间需要 25 分钟。

简介

Lucene是apache软件基金会4 jakarta项目组的一个子项目，是一个的全文检索引擎工具包，即它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎，部分引擎（英文与德文两种西方语言）。Lucene的目的是为人员提供一个简单易用的工具包，以方便的在目标系统中实现全文检索的功能，或者是以此为基础建立起完整的全文检索引擎。(摘自百度百科)

代码环境

操作系统：centos 5.8

开发环境：Eclipse 4.3

构建工具：Maven 4.0

Maven配置

为了能够按照书中的例子进行学习，这里依赖的Lucene版本是3.0.1


            
                 
      
       org.apache.lucene
                  
      
       lucene-core
                  
      
       3.0.1

完整配置：


        
     
      4.0.0
         
     
      com.linjl.study.book
         
     
      book_luceneInAction
         
     
      0.0.1-SNAPSHOT
         
             
      
       src
              
                  
                       
        
         maven-compiler-plugin
                        
        
         3.1
                        
                            
                             
                         
                    
               
          
         
             
                  
       
        org.apache.lucene
                   
       
        lucene-core
                   
       
        3.0.1

程序示例

下面将用两个例子进行Lucene入门讲解

案例一：建立索引

案例一主要展示通过对指定路径下.txt文件建立索引的过程

完整源码：

package com.linjl.study.book.luceneInAction.chapter1;import java.io.File;import java.io.FileFilter;import java.io.FileReader;import java.io.IOException;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;import org.apache.lucene.index.CorruptIndexException;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.store.Directory;import org.apache.lucene.store.FSDirectory;import org.apache.lucene.util.Version;public class Indexer {    private IndexWriter indexWriter;    public Indexer(String indexDir) throws IOException {        //步骤一:创建 Directory        Directory dir = FSDirectory.open(new File(indexDir));        //步骤二：创建 IndexWriter        indexWriter = new IndexWriter(dir, new StandardAnalyzer(                Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED);    }    public void close() throws CorruptIndexException, IOException {        //步骤五：关闭IndexWriter        indexWriter.close();    }    public int index(String dataDir, FileFilter fileFilter) throws IOException {        File[] files = new File(dataDir).listFiles();        for (File file : files) {            if (!file.isDirectory() && !file.isHidden() && file.exists()                    && file.canRead()                    && (fileFilter == null || fileFilter.accept(file))) {                indexFile(file);            }        }        return indexWriter.numDocs();    }    private void indexFile(File file) throws IOException {        System.out.println("Indexing " + file.getCanonicalPath());        //步骤三：创建Document对象        Document doc = getDocument(file);        //步骤四：添加Document        indexWriter.addDocument(doc);    }    protected Document getDocument(File file) throws IOException {        Document doc = new Document();        doc.add(new Field("contents", new FileReader(file)));        doc.add(new Field("filename", file.getName(), Field.Store.YES,                Field.Index.NOT_ANALYZED));        doc.add(new Field("fullpath", file.getCanonicalPath(), Field.Store.YES,                Field.Index.NOT_ANALYZED));        return doc;    }    private static class TextFilesFilter implements FileFilter {        public boolean accept(File pathname) {            return pathname.getName().toLowerCase().endsWith(".txt");        }    }    public static void main(String[] strs) throws IOException {        //存放索引的位置（linux环境下路径）        String indexDir = "/opt/test/lucene/index";        //存放待索引文件的位置（linux环境下路径）        String dataDir = "/opt/test/lucene/files";        long startTime = System.currentTimeMillis();        Indexer indexer = new Indexer(indexDir);        int numIndexed;        try {            numIndexed = indexer.index(dataDir, new TextFilesFilter());        } finally {            indexer.close();        }        long endTime = System.currentTimeMillis();        System.out.println("Indexing " + numIndexed + " files took "                + (endTime - startTime) + "ms");    }}

案例二：搜索索引

案例二展示如何通过对指定的索引文件夹进行关键词索引

完整源码：

package com.linjl.study.book.luceneInAction.chapter1;import java.io.File;import java.io.IOException;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.queryParser.ParseException;import org.apache.lucene.queryParser.QueryParser;import org.apache.lucene.search.IndexSearcher;import org.apache.lucene.search.Query;import org.apache.lucene.search.ScoreDoc;import org.apache.lucene.search.TopDocs;import org.apache.lucene.store.Directory;import org.apache.lucene.store.FSDirectory;import org.apache.lucene.util.Version;public class Searcher {    public static void search(String indexDir, String searchWord)            throws IOException, ParseException {        //步骤一：创建Directory        Directory dir = FSDirectory.open(new File(indexDir));        //步骤二：创建IndexSearcher        IndexSearcher indexSearcher = new IndexSearcher(dir);        //步骤三：创建QueryParser        QueryParser parser = new QueryParser(Version.LUCENE_30, "contents",                new StandardAnalyzer(Version.LUCENE_30));        long startTime = System.currentTimeMillis();        //步骤四：解析生成查询对象        Query query = parser.parse(searchWord);        //步骤五：查询并获取查询结果（只是获取到查询结果的引用）        TopDocs hits = indexSearcher.search(query, 30);        long endTime = System.currentTimeMillis();        System.out.println("Found " + hits.totalHits + "document(s) (in "                + (endTime - startTime) + "ms) that matched query '"                + searchWord + "':");        for (ScoreDoc scoreDoc : hits.scoreDocs) {            //步骤六：根据引用生成查询结果            Document doc = indexSearcher.doc(scoreDoc.doc);            System.out.println(doc.get("fullpath"));        }        //步骤七：关闭IndexSearcher        indexSearcher.close();    }    public static void main(String[] args) throws IOException, ParseException {        String indexDir = "/opt/test/lucene/index";        String searchWord = "床";        Searcher.search(indexDir, searchWord);    }

理解建立索引过程的核心类

IndexWriter
IndexWriter(写索引)是索引过程的核心组件。这个类负责创建新索引或者打开已有索引，以及向索引添加、删除或者更新文档的信息。他只能写入索引不能读取或者搜索索引。

Directory
Directory描述了Lucene索引存放的位置。它是一个抽象类，有很多子类，例子中的FsDirectory是基于文件系统的索引，还有基于内存等更多子类。

Analyzer
文本文件在被索引或者建立索引的时候都需要经过Analyzer(分析器)处理。它负责从被索引文本文件中提取语汇单元，并提出剩下的无用信息。Analyzer是一个抽象类，Lucene提供了几个实现类，不过对中文分词的效果不太好，网上有几个比较好的开源中文分词库，

Document
Document（文档）对象代表一些域（Field）的集合。你可以将Document理解为虚拟文档—比如Web页面、E-mail信息或者文本文件---然后你可以从中取回大量的数据。

Field
索引中的每个文档都包含一个或多个不同命名的域，这些鱼包含在Field类中，每个域都有一个域名和对应的域值，以及一组选项来精确控制Lucene索引操作各个域值。文档可能拥有不值一个同名的域。在这种情况下，域的值就按照索引操作顺序添加进去。在搜索时，所有域的文本就好像连接在一起，作为一个文本域来处理。

理解搜索过程的核心类

IndexSearcher
IndexSearcher类用于搜索由IndexWriter类创建的索引：这个类公开了几个搜索的方法，他是链接索引的中间环节，可以将IndexSearcher类看作是一个以只读方式打开索引的类。它需要利用Direcotry实例来控制前期创建的索引，然后才能提供大量的搜索方法。该类最简单的用法如下：
```
Directory dir = FSDirectory.open(new File("/tmp/index"));IndexSearcher searcher = new IndexSearcher(dir);Query q = new TermQuery(new Term("contents","lucene"));TopDocs hits = searcher.search(q,10);searcher.close();
```

Term
Term对象是搜索功能的基本单元，与Field对象类似，Term对象包含一对字符串元素：域名和单词。

Query
Lucene含有许多具体的Query(查询)子类。

TermQuery
TermQuery是Lucene提供的最基本的查询类型，也是简单查询类型之一。

TopDocs
TopDocs类是一个简单的指针容器，指针一般指向前N个排名的搜索结果，搜索结果即匹配查询条件的文档。TopDocs会记录前N个结果中每个结果的int docID（可以用它来回复文档）和浮点型分数

小结

本文主要是Lucene In Action 第一章的内容，通过2个例子，对lucene有了最初的认识和使用方法。

（全文完 20130904 深圳）

转载于:https://my.oschina.net/linjunlong/blog/159000

你可能感兴趣的文章