How to Index & Search a PDF in Java With Lucene

Written by steve mcdonnell Google
  • Share
  • Tweet
  • Share
  • Pin
  • Email
How to Index & Search a PDF in Java With Lucene
Index and search PDF files in Java with Lucene. (magnifier image by dinostock from Fotolia.com)

Apache Lucene is a full-featured text search engine library written in Java. You can use Lucene to index and search any kind of text document. To convert a Portable Document Format (PDF) file into a text format that Lucene can index, you can use the PDFBox open source class which has special methods specifically for Lucene. Simply provide the PDF file name to PDFBox and get a Lucene Document object that can be added to the index and searched just like any text file.

Skill level:
Easy

Other People Are Reading

Instructions

  1. 1

    Select a Lucene analyzer to use in creating the index, for example "StandardAnalyzer." Create an "IndexWriter" object to handle adding new items to the index, for example:

    IndexWriter my Writer = new IndexWriter("index", new StandardAnalyzer(), true);

  2. 2

    Call "LucenePDFDocument" to get a Lucene Document object of your PDF file. Add other key fields to the object and add the object to the Lucene index. For example:

    Document pdfDoc = LucenePDFDocument.getDoument(filename);

    pdfDoc.add(new Field("title", pdf.getTitle(), Field.Store.YES, Field.Index.TOKENIZED));

    pdfDoc.add(new Field("author", pdf.getAuthor(), Field.Store.YES, Field.Index.TOKENIZED));

    myWriter.addDocument(pdfDoc);

  3. 3

    Use the "SearchEngine" class to search the Lucene index. "SearchEngine" returns a Lucene "Hits" object with a list of "Hit" objects. For example:

    SearchEngine my Search = new SearchEngine();

    Hits my Hits = mySearch.performSearch(searchText);

    System.out.println("Documents matched: " + myHits.length());

  4. 4

    Iterate through the "Hit" objects to get more information about each match. The "Hit" objects are ordered by relevance to the search, and you can also obtain the relative search score with "getScore()." For example:

    Iterator<Hit> itr = myHits.iterator();

    while (itr.hasNext()) {

    Hit the Hit = itr.next();

    Document the Doc = theHit.getDocument();

    System.out.println(theDoc.get("title") + " - " + theHit.getScore());

    }

Don't Miss

Filter:
  • All types
  • Articles
  • Slideshows
  • Videos
Sort:
  • Most relevant
  • Most popular
  • Most recent

No articles available

No slideshows available

No videos available

By using the eHow.co.uk site, you consent to the use of cookies. For more information, please see our Cookie policy.