Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diff multiple GEDCOMs #323

Open
Sternbach-Software opened this issue Oct 31, 2022 · 7 comments
Open

Diff multiple GEDCOMs #323

Sternbach-Software opened this issue Oct 31, 2022 · 7 comments

Comments

@Sternbach-Software
Copy link

Sternbach-Software commented Oct 31, 2022

Over time, I have accumulated multiple trees across multiple platforms, and they have regrettably became out of sync. I want to condense them all into one tree, but that means diffing multiple trees, something gedcomdiff doesn't currently support.

It would be unbelievable if I could diff multiple GEDCOMs and display the diff into a single HTML. The coloring would represent that a field is either missing from one file or added/missing from the rest. It would be cool to see for a given field which files are missing it and which have it. Is this feasible?

Sample code (not a Go expert, but I think this works):

func runDiffCommand() {
	...
	var gedcoms = []gedcom.IndividualNodes{leftIndividuals, rightIndividuals}
	var multiple map[*gedcom.IndividualNodes]map[*gedcom.IndividualNodes]gedcom.IndividualComparisons
	//var comparisons gedcom.IndividualComparisons
	go func() {
		//comparisons = leftIndividuals.Compare(rightIndividuals, compareOptions)
		multiple = compareMultiple(gedcoms, compareOptions)
	}()
	...
	page := html.NewDiffPageMultiple(multiple, filterFlags, optionGoogleAnalyticsID,
		optionShow, optionSort, diffProgress, compareOptions, html.LivingVisibilityShow)
	
	go func() {
		/*_, err = page.WriteHTMLTo(out)
		if err != nil {
			log.Fatal(err)
		}
		_, err = pageMulti.WriteHTMLTo(out)
		if err != nil {
			log.Fatal(err)
		}

		close(diffProgress)
	}()
}

func compareMultiple(gedcoms []gedcom.IndividualNodes, compareOptions *gedcom.IndividualNodesCompareOptions) map[*gedcom.IndividualNodes]map[*gedcom.IndividualNodes]gedcom.IndividualComparisons {
	var mapOfLeftToRightsComparisons = make(map[*gedcom.IndividualNodes] /*left*/ map[*gedcom.IndividualNodes] /*right*/ gedcom.IndividualComparisons /*left.Compare(right)*/) // comparisons[x][y] is the diff of x with respect to y ("x.Compare(y)")
	for _, left := range gedcoms {
		var rightsToDiffs = make(map[*gedcom.IndividualNodes]gedcom.IndividualComparisons)
		for _, right := range gedcoms {
			if &right == &left {
				continue //don't compare the same gedcom with itself
			}
			rightsToDiffs[&right] = left.Compare(right, compareOptions)
		}
		mapOfLeftToRightsComparisons[&left] = rightsToDiffs
	}
	for left, rightsToDiffs := range mapOfLeftToRightsComparisons {
		for right, diff := range rightsToDiffs {
			s := fmt.Sprint("Left (", left, ") compared to right (", right, "):", diff)
			fmt.Println(s)
		}
	}
	return mapOfLeftToRightsComparisons
}
@Sternbach-Software
Copy link
Author

Sternbach-Software commented Oct 31, 2022

I'm wondering if the best you can do is remove definite duplicate fields or individuals.

@elliotchance
Copy link
Owner

There are two things to explore here:

  1. Cartesian match. This would allow any number of GEDCOMs to be provided (rather than left and right). However, this would be extremely expensive to process because the number of comparisons required would take it up to the next power. I don't think this is helpful in cases where a file contains more than even a few hundred individuals.
  2. Compare a primary against multiple others. I think this is the case you're referring to? That is, where you specify a left (primary) GEDCOM, but may specify multiple right GEDCOMs. This would certainly be more expensive, but linearly rather than exponentially.

Going with option 2, you may be able to test how that might work with something like:

-left-gedcom primary.gedcom -right-gecom right1.gedcom -right-gecom right2.gedcom

There might need to be some special options in this case where primary is always included and it just show the closest match (if any) from each respective right GEDCOM.

@Sternbach-Software
Copy link
Author

Sternbach-Software commented Nov 1, 2022

  1. is not feasible even with goroutines? It may take some time, but the user should know what they are getting into. I have some good hardware to spare and let it run for while (even for a few days honestly, but that is probably just me).

@Sternbach-Software
Copy link
Author

Sternbach-Software commented Nov 1, 2022

In my case, I have many trees from Geni.com and Ancestry, and I don't know which one is most comprehensive, and some have a lot of information (not necessarily regarding overlapping parts of the tree) that the others don't, so I don't have a primary. It is possible that by comparing each one, I could identify (or create) a primary though, but it would be work.

@Sternbach-Software
Copy link
Author

Sternbach-Software commented Nov 1, 2022

Unless similar to the tune command, there could be a command to determine which tree makes most sense to make the primary one and diff the others against it. Maybe the one which is missing the least? Though, that may not be good because if there was a tree with too much information on it which was incorrect, and the primary tree is the revised one with less information on it.

@Sternbach-Software
Copy link
Author

Unless similar to the tune command, there could be a command to determine which tree makes most sense to make the primary one and diff the others against it. Or, merge all of the "right" trees, and compare that to the left.

@elliotchance
Copy link
Owner

The GEDCOM comparison can already make full use of multiple cores to speed it up, see -jobs:

https://github.com/elliotchance/gedcom/blob/master/cmd/gedcom/diff.go#L84-L86

And, perhaps even better, if it knows two individuals are the same (by an identifier) it can avoid the expensive comparison altogether.

However, consider the numbers: Comparing two small trees of 1000 individuals takes 1 million comparisons (that's fine), but three trees of 1000 individuals requires 1 billion comparisons to be exhaustive (not fine). Trying to compare many trees (even if they are quite small) will exponentially increase the processing time required.

Depending on what you're goal is, it probably makes more sense to just choose a primary file and have everything work against that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants