|
Keyword Search
Search tips:
If you Type:
|
This is What
Will Happen
|
apple banana
|
Find pages that
contain at least one of these words
|
“apple
banana”
|
Finds pages
that contains this phrase
|
+apple
+banana
|
Finds pages
with both words
|
+apple pie
|
Finds pages
with the word apple but the relevance score will be higher if it also
contains pie
|
+apple -pie
|
Finds pages
with the word apple but not the word pie
|
+apple ~pie
|
Finds pages
with the word apple but will also find pages with both words. Pages that
contain just the word apple will have a higher relevance rating than those
that have both words.
|
+apple
+(>pie <strudel)
|
Finds pages
with the phrases apple pie and apple strudel but the relevance rating for
apple pie will be higher than for apple strudel
|
Apple*
|
Finds pages
with apple, apples, applesauce and applet
|
The boolean full-text search capability supports the following operators:
"
-
The phrase, that is enclosed in double quotes
" , matches only
pages that contain this phrase literally, as it was typed.
+
-
A leading plus sign indicates that this word must be
present in every page returned.
-
-
A leading minus sign indicates that this word must not be
present in any page returned.
-
< >
-
These two operators are used to change a word's contribution to the
relevance value that is assigned to a page. The
< operator
decreases the contribution and the > operator increases it.
See the example below.
( )
-
Parentheses are used to group words into subexpressions.
~
-
A leading tilde acts as a negation operator, causing the word's
contribution to the page relevance to be negative. It's useful for marking
noise words. A page that contains such a word will be rated lower than
others, but will not be excluded altogether, as it would be with the
- operator.
*
-
An asterisk is the truncation operator. Unlike the other operators, it
should be appended to the word, not prepended.
Optical Character Recognition Basics
When the microfilm was scanned we obtained a digital image. The image can be manipulated as a whole but its text cannot be manipulated separately. In order to do so, we need to "tell" the computer to recognise the text as such and to let us manipulate it as if it was text in a word document. The Optical Character Recognition (OCR) software used does that: it recognises the characters and makes the text searchable.
The prime measure of OCR performance, and its limitation, is accuracy. Character accuracy, the most important aspect of text recognition, varies widely based on the quality and nature of the image (particularly the type and size of the fonts used and in the complexity of the page layouts). Generally the better the image's quality is, the higher the accuracy. The accuracy is usually measured for each page during the OCR process as a percentage. 90% accuracy would imply ten errors out of a 100 characters. Due to the poor quality of many of the newspaper pages that had been microfilmed and subsequently scanned, OCR accuracy ranged from about 60% up to about 85%. In order to obtain higher accuracy it would have been necessary to "correct" the OCR results. That means that after the usual OCR, which is done by software, the output would be proofread and corrected by humans. To do this was outside the budget constraints of this project.
The overall result is therefore, text searchable pages but with less than 100% accuracy.
|
|