Quote:
Originally posted by hartmut
I have imported several pdf-files in one UR-Database.
The pdf- files were downloaded from different sites in the net.
Now I see in the Item Attributes window that some of the files are correctly indexed and others have only 3 keywords.
The files which are not indexed are all from the same site in the net.
Do anyone here have an idea about the reason and/or suggestion to resolve this problem.
As the files are in the same UR-Database and I did not change the option for the Import it must be something witeh the pdf-files
Thank you
Hartmut
|
There are apparently two kinds of pdfs--text and image. I think one pdf can have both elements. All the pdfs I have ever come across, however, have been mostly image. (When only a few words are indexed, it's because it's the only text in the document.)
For UR to index image pdfs, it would need ocr software. I finally found a product that turns pdfs into Word documents--very imperfectly, as can be expected since we're really talking about a scanning task. It is by Nuance (formerly Scansoft) and costs about $50. I doubt it would not be efficient for UR to include a component to do OCR on image pdfs. But the way to get them indexed is to turn them into something else, using software like the Scansoft product.