Genealogy database design overview

The genetic basis of genealogy

Barry Leadbeater

When designing any database, relational or not, the most basic requirement is that it models the intended real world situation accurately. And one of the tests that this is true is that no more restrictions apply to the data than exist in real life. So, if a genealogy database allows the entry of no more than, say, X children per couple, or Y marriages per person, then the database design is flawed because no such restrictions exist in life. However, this sort of restriction does apply to some of the genealogy database programs. They often model a person and their family as something like

            person = spouse                     (1)
      ___________________ _ _ _ _ _ _____
     |             |                     |
  child 1       child 2               child X

This definition of a generalised family is inadequate because the number of children in actual families varies. The number of persons' spouses is also variable. The result is problems in both building and using the database.

GEDCOM data uses this model in the FAM records, but, because of its freeform structure, does not suffer from any limitation to the number of children per family.

The fundamental database definition

It is preferable to base the model on the simplest possible definition of a person and their immediate family - a definition which applies to everyone without requiring adaptation. This is

            father = mother                     (2)
                   | sexual reproduction

To implement model (2) efficiently, a two-table relational database is required. One table contains all persons. The other contains all parent pairs. This second table can, of course, also contain childless partnerships which are handled without modification. In this way it may do double duty as the marriage event table but its basic purpose is to contain all couplings resulting in children. The two tables are linked together to generate families from the individuals. All family relationships are automatically represented.

GEDCOM data incorporates this model. The INDI records contain every individual person. The FAM records contain all marriages. But they also unnecessarily contain references to the children of each marriage. The result is a considerable amount of redundant reference data with the potential for conflicts and inconsistencies. That is, the GEDCOM model is fundamentally flawed and therefore should not be used as the basis for a genealogy database design. However, it will still be necessary in the forseeable future to provide a means of converting data to and from the GEDCOM format to enable export to, and import from other GEDCOM enabled databases.

A good model should be able to cope with all the modern methods of producing children, including artificial insemination of sperm, artificial implantation of the fertilised egg in the genetic mother or in a surrogate, and even cloning. Multiple spouses, at the same time or sequentially, should not cause any problem.

Coping with clones

Model (2) easily copes with all these situations with the one possible exception of cloning. In this case, the one and only parent provides both sets of chromosomes (and therefore genes) to the clone. Cloning of people is analogous to propagating plants from cuttings. It might seem that we can only use our model if we say that the one parent is half father and half mother. The father half of the parent provides one set of chromosomes and the mother half, the other set. The model appears to be crude, since the simplest model of cloning seems to be the one-to-one relationship

                parent                         (3)
                   | asexual reproduction

However, genetically, the clone is a sibling rather than a child - equivalent to an identical twin. So, in fact, model (2) applies without modification and results in a family group like

            father = mother                    (4)
     |             |              |
  child 1       child 2  →  child 2 clone

Coping with genetic engineering

Three-parent children, where the egg is supplied by one woman with its nucleus replaced by one from another woman's egg, are being proposed to avoid mitochondrially inherited diseases. This procedure would affect about 35 mitochondrial genes out of the total of about 30,000 (i.e. only about 0.1%). The present model would ignore the mitochondrial genes and therefore would not fully cope.

Then there's the possibility of artificially creating the sperm and egg from other cells of the parents. This would, amongst other possibilities, allow gay couples to have children genetically their own. Each of the parents could be father to some of their children and mother to the others! Experiments with mice have been successful.

Coping with unknown parents

Model (2) easily copes with one or two unknown parents. A person with both parents unknown always has a record in the persons table but there may not be any parents record. However, if there are known siblings, they each need to refer to the one parents record even though the parents' names are unknown. This parents record serves to link the individual persons as siblings.

Coping with adoption and fostering

Genetic parents (known or unknown) can be replaced by foster or adopting parents if so desired but there will no longer be any genetic connection (unless the foster/adopting parents are related to the genetic parents).

Expanding the design to a family history database


Family historians would wish to expand the previously described basic genealogy database into a fully relational family history database. This requires, as a minimum, the addition of an events table containing the event type, place, and time the event occurred. The marriage event type is the most important as it will probably replace the parent pairs table described previously. The database user should also be able to add any other desired event types to allow their own family history to be fully documented.


In the various tables, a field for the sources of the information is highly desirable, together with a sub-table of sources, preferably including an estimate of confidence level in the information given in each source. Some family history database designers prefer to model the source material rather than the facts. The sources table is their fundamental database table. This simplifies the design for coping with multiple sources of the same data, especially when they conflict. However, this design clearly models the paper trails, not the actual family history. Of course, it may be ideal to combine the two databases, the "facts" database being derived from the "sources" database, taking into account the confidence level of each source.

Searchable genealogy database of SA colonial pioneer families