Genealogy, like most other interests, has certain guidelines which
should be followed. The recording of females is done using the maiden
name. Dates are recorded in a set manner. Likewise, researchers are
expected to cite the sources used in compiling their family history
All too often in the research phase, new information is discovered in
a published volume. When the researcher tries to determine where the
compiler found the information, frustration mounts as the researcher
discovers that no sources were cited. Such a volume is automatically
considered suspect and the job of recreating the research begins.
Similar scenarios take place every day as researchers surf the
Internet in search of family history pages. For some reason, online
researchers often do not apply the same rule of thumb to published
family history web pages as they would to a published book. It is just
as important, nay, more important, to see the sources used in compiling a
family history web page. If sources are not provided, one should try to
determine where the data came from and how it was digitized for
inclusion on the Internet.
Where Did It Come From and How Did It Get to the Internet?
While most genealogical researchers think that compiled family
history Internet pages are where the most information will be found,
more time is actually spent using transcription sites or compiled
databases. These differ from a compiled family history page in the
manner of presentation. Compiled family history pages are similar to the
narrative style of the Register or the National Genealogical
Society Quarterly style of family history reports found in books and
periodicals. Many other websites offer transcribed data from census
records, probate records, indexes to vital records, and so forth.
Determining how that information has been digitized is the key to
effective searching online and separating fact from fiction.
Generally, information materializes online in one of three ways:
Each method offers both pros and cons pertaining to accuracy and ease
Many volunteer organizations rely on this method to add information
to the Internet. Some groups have many volunteers, while others may have
just a handful. These busy individuals will take information found
either on microfilm or photocopies and transcribe or abstract it,
creating a file that can then be uploaded to the Internet.
As with any transcription or abstraction project, the human entity
offers room for error. The records are at least one more generation
removed from the original documents. Each generation removed offers that
many more chances for errors to creep in. Some of these errors may
result in names misspelled or incorrect dates transposed. These errors
are not intentional, just the result of the frailty of humans.
Depending on the project, it is possible that some of the errors will
be caught and corrected during the editing phase. However, even with
the best of editors, some errors will be overlooked, and thus make it
into the final file as it is displayed on the Internet.
The best projects will offer full source citation, allowing the
researcher to return to the original documents during the verification
aspect of the research process. Even with the most reliable of sites, it
is a good idea to verify the information with the original records, if
available. Transcription and abstraction sites are good index resources
that guide a researcher to the exact location in the original records.
Thus, their importance is no less, as they play a key role in the
Most researchers today are sandwiching their family history research
in amongst all the other duties required of them in a given week.
Transcription and abstraction sites allow these researchers to make
progress even when they cannot access the original records right away.
Optical Character Recognition (OCR)
Companies in the business of creating digitized files of previously
published works use Optical Character Recognition programs. OCR software
takes a typed page and converts the scanned image to a text file.
Basically, OCR software is reading the page to the computer. Such a file
can then be worked on in a word processing program, added to a
database, or made available on the Internet either as a web page or in a
searchable database. OCR software moves faster than the fastest typist,
allowing more pages to be digitized in a shorter period.
Present publishing capabilities through the use of laser jet printers
and high quality ink jet printers make it difficult to understand where
a problem could lie with this process. However, most of the records
that are run through OCR software via scanners are not that clear. Many
of them have smudged ink or old printing press type. Ink splotches can
throw off the OCR software. Even clear pages can be misread if the
document is not held firmly on the scanner if using a flat bed scanner,
or not fed through evenly in a sheet fed scanner.
Companies rely on OCR software to process more pages than would be
possible through the work of typists. While most companies have a
quality assurance department responsible for checking the overall
quality of the resulting digital files, errors can still creep in. More
information can be processed in this manner than with manual entry, but
there is still a limit to the number of pages a human can compare with
As with manual entry, companies offering OCR-generated files should
include source citations that allow the researcher to take information
found in the online database and compare it to the original record.
There will be times when a researcher may not be able to get the
original record. In such a case, it is necessary for the researcher to
be open to the possibility that the information found in the online
database may be in error.
A recent move in both online and CD-ROM products is the digitized
image. This process uses a scanner just as the OCR process does. The
difference is that instead of trying to convert the image to a text
file, the image is stored as a graphical file. The benefit of this
process is that companies (those who are most often able to afford to
digitize) are not limited to only resources of the typed or published
variety. Census records and other handwritten documents can be included
through this process. Another benefit to this method is that instead of
having to guess if errors have inadvertently been added to the file, the
researcher can see the original document as though working with a
microfilm on a microfilm reader.
Two drawbacks to this method are the size of the files, which result
in an increased download time to the researcher, and the limitations in
indexing. Digitized image files are much larger than a text file. While a
text file may load in sections, allowing the researcher to read as the
rest is loading, images often must load completely before they are
readable. Companies may still have to manually create the index, thus
resulting in the same error problems mentioned earlier. However, a
researcher can generally scroll through the digitized image pages
searching for an entry if that individual was not found in the manually
created, and thus searchable, index.
This process is now used for books as well as handwritten documents.
Nearly the entire run of the New England Historical and Genealogical
Register has been added to this website using the digitized graphics
method. The researcher looks at the pages just as though picking up a
volume of the Register off the shelf at the library. The only
errors included in such digitized images are those that were included in
the original resource, whether it be a book, periodical, or census
Telling the Difference
There is virtually no way to tell the difference between data entered
manually and data fed through an OCR program, whether online or on
CD-ROM. However, most volunteer groups recognize the efforts of those
who have spent the hours necessary in transcribing or abstracting the
information made available in their databases. The USGenWeb sites are prime examples of
work done using the manual entry method.
Telling the difference between manually entered images and OCR or
digitally generated images is easy. Digitized images look just like a
photocopy of the original, complete with smudges and irregularities in
the type or printing process. Manually entered or OCR created files will
look like typed text, usually in one of the standard fonts like Times
New Roman or Helvetica. There will also not be a page edge, as is often
seen with the digitized image files. Of course, handwritten records are
easy to identify, as they are always digitized images.
However, the index used to search such records was created using
manual entry. The same problems inherent with any index are likely to be
present in this type of index. When researching a family name, the
researcher is looking through each page of a handwritten document for a
specific name to leap off the page. Indexers, on the other hand, must
try to evaluate the handwriting to determine what surname or given name
was actually written on the page.
Another way a researcher can distinguish between these different
methods relates to how effective the index is. When the researcher types
in a name, if the search takes the researcher to the exact entry,
highlighting it, then the information was entered either manually or
through OCR. The closest a researcher can get to an index in a digitized
image file is the page where the entry can be found, requiring the
researcher to then scan the page looking for the name of the desired
Compiled Family Histories
Compiled family histories on the Internet take on many different
forms. There are the narrative style reports generated from the many
different genealogy software programs on the market today. There are
also compiled database sites made up of GEDCOM files. Both types of
sites are the result of research by fellow researchers. Their level of
knowledge and experience in researching a family tree will range from
novice to professional. Generally, there is no way to tell the level of
experience unless the submitter of the GEDCOM file or the creator of the
web page has cited sources.
The biggest problem with compiled family histories, regardless of the
final format made available online, is the origin of the research.
Unfortunately, there is a disturbing trend where researchers are
downloading GEDCOM files from sites such as RootsWeb's WorldConnect or the Family History Library's
Ancestral File or Pedigree Resource File and simply incorporating that
data into their family history database. They then generate the
unverified information into a GEDCOM file or a narrative genealogy
report and put it back out onto the Internet. The result of this process
is the propagation of misinformation.
Because of its easy access on the Internet, this misinformation
spreads more quickly than it would through published books. It is easier
to find with the myriad of search engines available and the scenario is
almost a never-ending cycle as researchers download then upload the
The Internet is a resource that offers researchers the ability to
make progress with their family history research despite the limitations
of the operational hours of libraries. Unfortunately, many researchers
new to this hobby are limiting their research to just the Internet.
While computers are marvelous tools, most of the information found
online has been manipulated in some way by humans and generally should
be verified with original, primary documents whenever possible. As
researchers, it is important to demand source citations regardless of
the individual or company making the data available. Without the source
citation, the information is nearly impossible to verify and it defeats
the purpose of sharing it, regardless of the final output used online.
Before a researcher incorporates any information found online,
whether downloaded via a GEDCOM file, or obtained while reading the
information from a family history web site, he or she should keep such
information separate from a personal database. It is much easier to add
information to a database later, after verifying its accuracy, than it
is to try to delete individuals who later turn out not to be related.
It is important to remember that a source is any record, index,
CD-ROM, or web site where information was found. Researchers are always
told to cite sources, but unfortunately, few are diligent in this. With
the Internet, it is all the more important that the information found by
a researcher be held accountable. No sources? The information is
suspect and should be quarantined to a separate database until verified.
Even when sources are cited, they are often simply another person's
GEDCOM file or web page. No further research has been undertaken before
making the information available on yet another web page or through
another GEDCOM file. As a result, there is a lot of fiction currently
found out on the Internet under the guise of family history fact.