American Ancestors New England Historic Genealogical Society - Founded 1845 N.E. Historic Genealogical Society Seal View Your Shopping Cart Join NEHGS
  • Genealogy and Technology: Researching Online: Separating Fact from Fiction

    Rhonda R. McClure

    Genealogy, like most other interests, has certain guidelines which should be followed. The recording of females is done using the maiden name. Dates are recorded in a set manner. Likewise, researchers are expected to cite the sources used in compiling their family history information.

    All too often in the research phase, new information is discovered in a published volume. When the researcher tries to determine where the compiler found the information, frustration mounts as the researcher discovers that no sources were cited. Such a volume is automatically considered suspect and the job of recreating the research begins.

    Similar scenarios take place every day as researchers surf the Internet in search of family history pages. For some reason, online researchers often do not apply the same rule of thumb to published family history web pages as they would to a published book. It is just as important, nay, more important, to see the sources used in compiling a family history web page. If sources are not provided, one should try to determine where the data came from and how it was digitized for inclusion on the Internet.

    Where Did It Come From and How Did It Get to the Internet?

    While most genealogical researchers think that compiled family history Internet pages are where the most information will be found, more time is actually spent using transcription sites or compiled databases. These differ from a compiled family history page in the manner of presentation. Compiled family history pages are similar to the narrative style of the Register or the National Genealogical Society Quarterly style of family history reports found in books and periodicals. Many other websites offer transcribed data from census records, probate records, indexes to vital records, and so forth. Determining how that information has been digitized is the key to effective searching online and separating fact from fiction.

    Generally, information materializes online in one of three ways:

    • Manual entry
    • Optical Character Recognition (OCR)
    • Digitized Images

    Each method offers both pros and cons pertaining to accuracy and ease of use.

    Manual Entry

    Many volunteer organizations rely on this method to add information to the Internet. Some groups have many volunteers, while others may have just a handful. These busy individuals will take information found either on microfilm or photocopies and transcribe or abstract it, creating a file that can then be uploaded to the Internet.

    As with any transcription or abstraction project, the human entity offers room for error. The records are at least one more generation removed from the original documents. Each generation removed offers that many more chances for errors to creep in. Some of these errors may result in names misspelled or incorrect dates transposed. These errors are not intentional, just the result of the frailty of humans.

    Depending on the project, it is possible that some of the errors will be caught and corrected during the editing phase. However, even with the best of editors, some errors will be overlooked, and thus make it into the final file as it is displayed on the Internet.

    The best projects will offer full source citation, allowing the researcher to return to the original documents during the verification aspect of the research process. Even with the most reliable of sites, it is a good idea to verify the information with the original records, if available. Transcription and abstraction sites are good index resources that guide a researcher to the exact location in the original records. Thus, their importance is no less, as they play a key role in the research process.

    Most researchers today are sandwiching their family history research in amongst all the other duties required of them in a given week. Transcription and abstraction sites allow these researchers to make progress even when they cannot access the original records right away.

    Optical Character Recognition (OCR)

    Companies in the business of creating digitized files of previously published works use Optical Character Recognition programs. OCR software takes a typed page and converts the scanned image to a text file. Basically, OCR software is reading the page to the computer. Such a file can then be worked on in a word processing program, added to a database, or made available on the Internet either as a web page or in a searchable database. OCR software moves faster than the fastest typist, allowing more pages to be digitized in a shorter period.

    Present publishing capabilities through the use of laser jet printers and high quality ink jet printers make it difficult to understand where a problem could lie with this process. However, most of the records that are run through OCR software via scanners are not that clear. Many of them have smudged ink or old printing press type. Ink splotches can throw off the OCR software. Even clear pages can be misread if the document is not held firmly on the scanner if using a flat bed scanner, or not fed through evenly in a sheet fed scanner.

    Companies rely on OCR software to process more pages than would be possible through the work of typists. While most companies have a quality assurance department responsible for checking the overall quality of the resulting digital files, errors can still creep in. More information can be processed in this manner than with manual entry, but there is still a limit to the number of pages a human can compare with originals.

    As with manual entry, companies offering OCR-generated files should include source citations that allow the researcher to take information found in the online database and compare it to the original record. There will be times when a researcher may not be able to get the original record. In such a case, it is necessary for the researcher to be open to the possibility that the information found in the online database may be in error.

    Digital Images

    A recent move in both online and CD-ROM products is the digitized image. This process uses a scanner just as the OCR process does. The difference is that instead of trying to convert the image to a text file, the image is stored as a graphical file. The benefit of this process is that companies (those who are most often able to afford to digitize) are not limited to only resources of the typed or published variety. Census records and other handwritten documents can be included through this process. Another benefit to this method is that instead of having to guess if errors have inadvertently been added to the file, the researcher can see the original document as though working with a microfilm on a microfilm reader.

    Two drawbacks to this method are the size of the files, which result in an increased download time to the researcher, and the limitations in indexing. Digitized image files are much larger than a text file. While a text file may load in sections, allowing the researcher to read as the rest is loading, images often must load completely before they are readable. Companies may still have to manually create the index, thus resulting in the same error problems mentioned earlier. However, a researcher can generally scroll through the digitized image pages searching for an entry if that individual was not found in the manually created, and thus searchable, index.

    This process is now used for books as well as handwritten documents. Nearly the entire run of the New England Historical and Genealogical Register has been added to this website using the digitized graphics method. The researcher looks at the pages just as though picking up a volume of the Register off the shelf at the library. The only errors included in such digitized images are those that were included in the original resource, whether it be a book, periodical, or census page.

    Telling the Difference

    There is virtually no way to tell the difference between data entered manually and data fed through an OCR program, whether online or on CD-ROM. However, most volunteer groups recognize the efforts of those who have spent the hours necessary in transcribing or abstracting the information made available in their databases. The USGenWeb sites are prime examples of work done using the manual entry method.

    Telling the difference between manually entered images and OCR or digitally generated images is easy. Digitized images look just like a photocopy of the original, complete with smudges and irregularities in the type or printing process. Manually entered or OCR created files will look like typed text, usually in one of the standard fonts like Times New Roman or Helvetica. There will also not be a page edge, as is often seen with the digitized image files. Of course, handwritten records are easy to identify, as they are always digitized images.

    However, the index used to search such records was created using manual entry. The same problems inherent with any index are likely to be present in this type of index. When researching a family name, the researcher is looking through each page of a handwritten document for a specific name to leap off the page. Indexers, on the other hand, must try to evaluate the handwriting to determine what surname or given name was actually written on the page.

    Another way a researcher can distinguish between these different methods relates to how effective the index is. When the researcher types in a name, if the search takes the researcher to the exact entry, highlighting it, then the information was entered either manually or through OCR. The closest a researcher can get to an index in a digitized image file is the page where the entry can be found, requiring the researcher to then scan the page looking for the name of the desired individual.

    Compiled Family Histories

    Compiled family histories on the Internet take on many different forms. There are the narrative style reports generated from the many different genealogy software programs on the market today. There are also compiled database sites made up of GEDCOM files. Both types of sites are the result of research by fellow researchers. Their level of knowledge and experience in researching a family tree will range from novice to professional. Generally, there is no way to tell the level of experience unless the submitter of the GEDCOM file or the creator of the web page has cited sources.

    The biggest problem with compiled family histories, regardless of the final format made available online, is the origin of the research. Unfortunately, there is a disturbing trend where researchers are downloading GEDCOM files from sites such as RootsWeb's WorldConnect or the Family History Library's Ancestral File or Pedigree Resource File and simply incorporating that data into their family history database. They then generate the unverified information into a GEDCOM file or a narrative genealogy report and put it back out onto the Internet. The result of this process is the propagation of misinformation.

    Because of its easy access on the Internet, this misinformation spreads more quickly than it would through published books. It is easier to find with the myriad of search engines available and the scenario is almost a never-ending cycle as researchers download then upload the same information.

    In Conclusion

    The Internet is a resource that offers researchers the ability to make progress with their family history research despite the limitations of the operational hours of libraries. Unfortunately, many researchers new to this hobby are limiting their research to just the Internet. While computers are marvelous tools, most of the information found online has been manipulated in some way by humans and generally should be verified with original, primary documents whenever possible. As researchers, it is important to demand source citations regardless of the individual or company making the data available. Without the source citation, the information is nearly impossible to verify and it defeats the purpose of sharing it, regardless of the final output used online.

    Before a researcher incorporates any information found online, whether downloaded via a GEDCOM file, or obtained while reading the information from a family history web site, he or she should keep such information separate from a personal database. It is much easier to add information to a database later, after verifying its accuracy, than it is to try to delete individuals who later turn out not to be related.

    It is important to remember that a source is any record, index, CD-ROM, or web site where information was found. Researchers are always told to cite sources, but unfortunately, few are diligent in this. With the Internet, it is all the more important that the information found by a researcher be held accountable. No sources? The information is suspect and should be quarantined to a separate database until verified.

    Even when sources are cited, they are often simply another person's GEDCOM file or web page. No further research has been undertaken before making the information available on yet another web page or through another GEDCOM file. As a result, there is a lot of fiction currently found out on the Internet under the guise of family history fact.

New England Historic Genealogical Society
99 - 101 Newbury Street
Boston, Massachusetts 02116, USA

© 2010 - 2014 New England Historic Genealogical Society