View Single Post
  #3  
Old 03-27-2006, 02:50 PM
srdiamond srdiamond is online now
Registered User
 
Join Date: 11-23-2004
Location: Los Angeles
Posts: 126
Re: pdf import not indexed

Quote:
Originally posted by hartmut
I have imported several pdf-files in one UR-Database.
The pdf- files were downloaded from different sites in the net.
Now I see in the Item Attributes window that some of the files are correctly indexed and others have only 3 keywords.
The files which are not indexed are all from the same site in the net.
Do anyone here have an idea about the reason and/or suggestion to resolve this problem.
As the files are in the same UR-Database and I did not change the option for the Import it must be something witeh the pdf-files
Thank you


Hartmut
There are apparently two kinds of pdfs--text and image. I think one pdf can have both elements. All the pdfs I have ever come across, however, have been mostly image. (When only a few words are indexed, it's because it's the only text in the document.)


For UR to index image pdfs, it would need ocr software. I finally found a product that turns pdfs into Word documents--very imperfectly, as can be expected since we're really talking about a scanning task. It is by Nuance (formerly Scansoft) and costs about $50. I doubt it would not be efficient for UR to include a component to do OCR on image pdfs. But the way to get them indexed is to turn them into something else, using software like the Scansoft product.
Reply With Quote