How do you manage your project? Data Conversion and Digitization

B

Boboy

We’ll be working a project in a government learning institution. I was hired to manage the project. There are volumes of many data to be converted, digitized into texts. So it wil be scanned, OCR is optic character recognition, into texts. These texts then must of course be turned into useful information, sorted maybe become a part of a database etc. What I heard is that it's about physical research materials (written like "masters or PhD") who wil be OCRed into text. Does anyone have experience with this? Can you share journey or give some tips for us newcomers to the field?
 
Q

QAMTY

Hi,look for a company that have scanners,a good ocr software,they usually have a special software for this task.
I hired a company which scanned about, 100000 documents,including books,manuals , letters,brochures, etc.
They were ocred,converted to Pdf format.
Two important issues, define the appropriate resolution, because you may have a very crisp document scanned at 3000 dpi, but its size is maybe 15 mb,while scanned at 200 dpi,maybe is 300 kb,and most of the times 300 dpi,is ok.
Consider that heavy pdfs, will take longer to be read.
Other point,in order to organize the data for an easy sear ching,it is important to define a good structure.
For example it could be,manuals,type,region,speciality ,other, pictures,type,region,zone,etc.
Normally they sell a special software for the management of the data,it is better compared to the option of reading documents by using only the windows explorer and a pdf reader.
Other issue,what are you going to do to ocrscan additional documents,once they left the company, remember that they should be scanned in a similar way to previous, documents,additionally there are out there document management software, with it, you can manage the pdfs as well.

Regards
 
B

Boboy

Thank you, QAMTY, for your reply. Very useful. I am on the right track of being pro-active. I will prepare some questions and request that they be answered because I am willing to do my best for the company. On my part, there are no wasted research. Who knows, I might stumble on something too. I’ll give more info once my questions were answered. Btw, I would appreciate if anyone could suggest “right” questions to ask, something like a gap analysis, but fit for the project - migration from paper based to paper free. Again, thanks a lot.
 

howste

Thaumaturge
Trusted Information Resource
I don't claim to be an expert but I've done a fair amount of scanning and OCR over the years. If you're converting to text then you need to scan at the best resolution for OCR. Searches in a database could be useless if there are key words that aren't recognized correctly. If you're also planning to save the images to PDF and space is a factor, high resolution images can always be reduced after the OCR for better storage.
 
Q

QAMTY

For the searching most of the software, use metadata ,it is a key data on every document, which help the sw for the searching, it doesnt use the ocred text.
Regards
 
B

Boboy

I don't claim to be an expert but I've done a fair amount of scanning and OCR over the years. If you're converting to text then you need to scan at the best resolution for OCR. Searches in a database could be useless if there are key words that aren't recognized correctly. If you're also planning to save the images to PDF and space is a factor, high resolution images can always be reduced after the OCR for better storage.


Hi Howste. Thanks for your experiential in-put.

While formulating some questions, I noticed some confusing terminologies, and should be defined, because books, physical research materials (thesis and dissertation) can easily be interpreted as “Document”. I might call them product document. And i might call the other type of document as management document. Please suggest better or more appropriate terminologies.

My examples....

Management Documents: written procedures, manuals, blank forms, current memos, notices, etc

Records: filed out forms, archived procedures, memos, notices

Product Document: book, physical research (masteral and Phd)

Appreciate your help.
 

howste

Thaumaturge
Trusted Information Resource
My examples....

Management Documents: written procedures, manuals, blank forms, current memos, notices, etc

Records: filed out forms, archived procedures, memos, notices

Product Document: book, physical research (masteral and Phd)

Appreciate your help.

I don't see a problem with the terms you've used, as long as they work for the intended users of the system you're setting up. Alternatives to your Management Documents category that I've heard are "Policies and Procedures," "Instruction Documents" or "Command Media." Product Documents could be "Publications," "Scientific & Technical Information," or "Research Results." Overall I don't think the terms you use are important as long as they adequately describe what they are so people can find what they need.
 
Q

QAMTY

Have into consideration that some documents nay be hand written, so that the Ocred output is not completely understood.

In fact, take into consideration that you may have problems in the conversion regarding the reliability of the text recognition.


Before taking a decision, have a comparative table of suppliers, take some "difficult documents" hand-written, pictures, high density text pages, etc., ask them to do some testings, see the results and you may note important differences among them.

Important:
-How quickly you open a heavy document
-Ease of navigation
-Organization structure
-Features of SW to be used, for the scanning and also for the browsing.
-Cost of SW, upgrades,etc.
-SW manufacturer
-SW Support
-Sw latest technologies


Regards
 
B

Boboy

Have into consideration that some documents nay be hand written, so that the Ocred output is not completely understood.

In fact, take into consideration that you may have problems in the conversion regarding the reliability of the text recognition.


Before taking a decision, have a comparative table of suppliers, take some "difficult documents" hand-written, pictures, high density text pages, etc., ask them to do some testings, see the results and you may note important differences among them.

Important:
-How quickly you open a heavy document
-Ease of navigation
-Organization structure
-Features of SW to be used, for the scanning and also for the browsing.
-Cost of SW, upgrades,etc.
-SW manufacturer
-SW Support
-Sw latest technologies


Regards
Please excuse my ignorance, but what is SW?

QAMTY, I want you to know how much I value your support.
 
Top Bottom