Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReadingOrder broken for recursive groups #15

Closed
bertsky opened this issue May 4, 2020 · 10 comments
Closed

ReadingOrder broken for recursive groups #15

bertsky opened this issue May 4, 2020 · 10 comments

Comments

@bertsky
Copy link

bertsky commented May 4, 2020

I struggle getting PageViewer to correctly visualise a correct but recursive reading order.

Example:

Let's start off with a page without region recursion:

PAGE-XML snippet

        <pc:ReadingOrder>
            <pc:OrderedGroup id="reading-order">
                <pc:RegionRefIndexed index="0" regionRef="Gutachten2-2_region0001"/>
                <pc:RegionRefIndexed index="1" regionRef="region0009"/>
                <pc:RegionRefIndexed index="2" regionRef="Gutachten2-2_region0002"/>
                <pc:RegionRefIndexed index="3" regionRef="Gutachten2-2_region0003"/>
                <pc:RegionRefIndexed index="4" regionRef="Gutachten2-2_region0004"/>
                <pc:RegionRefIndexed index="5" regionRef="Gutachten2-2_region0005"/>
                <pc:RegionRefIndexed index="6" regionRef="Gutachten2-2_region0006"/>
                <pc:RegionRefIndexed index="7" regionRef="Gutachten2-2_region0007"/>
                <pc:RegionRefIndexed index="8" regionRef="Gutachten2-2_region0008"/>
                <pc:RegionRefIndexed index="9" regionRef="Gutachten2-2_region0009"/>
                <pc:RegionRefIndexed index="10" regionRef="region0022"/>
                <pc:RegionRefIndexed index="11" regionRef="Gutachten2-2_region0010"/>
                <pc:RegionRefIndexed index="12" regionRef="Gutachten2-2_region0011"/>
                <pc:RegionRefIndexed index="13" regionRef="Gutachten2-2_region0012"/>
                <pc:RegionRefIndexed index="14" regionRef="region0030"/>
                <pc:RegionRefIndexed index="15" regionRef="Gutachten2-2_region0013"/>
                <pc:RegionRefIndexed index="16" regionRef="Gutachten2-2_region0014"/>
            </pc:OrderedGroup>
        </pc:ReadingOrder>

Gutachten2-2 xml

ReadingOrder contains a single OrderedGroup with a flat list of RegionRefIndexed, including tables without further structure. PageViewer correctly displays the order.

Now let's add some TextRegion cells to the tables, and properly include them in ReadingOrder: each RegionRefIndexed becomes an OrderedGroupIndexed of the same @regionRef and @index:

PAGE-XML snippet

        <pc:ReadingOrder>
            <pc:OrderedGroup id="reading-order">
                <pc:RegionRefIndexed index="0" regionRef="Gutachten2-2_region0001"/>
                <pc:RegionRefIndexed index="2" regionRef="Gutachten2-2_region0002"/>
                <pc:RegionRefIndexed index="3" regionRef="Gutachten2-2_region0003"/>
                <pc:RegionRefIndexed index="4" regionRef="Gutachten2-2_region0004"/>
                <pc:RegionRefIndexed index="5" regionRef="Gutachten2-2_region0005"/>
                <pc:RegionRefIndexed index="6" regionRef="Gutachten2-2_region0006"/>
                <pc:RegionRefIndexed index="7" regionRef="Gutachten2-2_region0007"/>
                <pc:RegionRefIndexed index="8" regionRef="Gutachten2-2_region0008"/>
                <pc:RegionRefIndexed index="9" regionRef="Gutachten2-2_region0009"/>
                <pc:RegionRefIndexed index="11" regionRef="Gutachten2-2_region0010"/>
                <pc:RegionRefIndexed index="12" regionRef="Gutachten2-2_region0011"/>
                <pc:RegionRefIndexed index="13" regionRef="Gutachten2-2_region0012"/>
                <pc:RegionRefIndexed index="15" regionRef="Gutachten2-2_region0013"/>
                <pc:RegionRefIndexed index="16" regionRef="Gutachten2-2_region0014"/>
                <pc:OrderedGroupIndexed id="region0009_group" regionRef="region0009" index="1">
                    <pc:RegionRefIndexed index="0" regionRef="region0009_region0001"/>
                    <pc:RegionRefIndexed index="1" regionRef="region0009_region0002"/>
                    <pc:RegionRefIndexed index="2" regionRef="region0009_region0003"/>
                    <pc:RegionRefIndexed index="3" regionRef="region0009_region0004"/>
                    <pc:RegionRefIndexed index="4" regionRef="region0009_region0005"/>
                    <pc:RegionRefIndexed index="5" regionRef="region0009_region0006"/>
                    <pc:RegionRefIndexed index="6" regionRef="region0009_region0007"/>
                    <pc:RegionRefIndexed index="7" regionRef="region0009_region0008"/>
                </pc:OrderedGroupIndexed>
                <pc:OrderedGroupIndexed id="region0022_group" regionRef="region0022" index="10">
                    <pc:RegionRefIndexed index="0" regionRef="region0022_region0001"/>
                    <pc:RegionRefIndexed index="1" regionRef="region0022_region0002"/>
                    <pc:RegionRefIndexed index="2" regionRef="region0022_region0003"/>
                    <pc:RegionRefIndexed index="3" regionRef="region0022_region0004"/>
                    <pc:RegionRefIndexed index="4" regionRef="region0022_region0005"/>
                    <pc:RegionRefIndexed index="5" regionRef="region0022_region0006"/>
                </pc:OrderedGroupIndexed>
                <pc:OrderedGroupIndexed id="region0030_group" regionRef="region0030" index="14">
                    <pc:RegionRefIndexed index="0" regionRef="region0030_region0001"/>
                    <pc:RegionRefIndexed index="1" regionRef="region0030_region0002"/>
                    <pc:RegionRefIndexed index="2" regionRef="region0030_region0003"/>
                    <pc:RegionRefIndexed index="3" regionRef="region0030_region0004"/>
                    <pc:RegionRefIndexed index="4" regionRef="region0030_region0005"/>
                    <pc:RegionRefIndexed index="5" regionRef="region0030_region0006"/>
                </pc:OrderedGroupIndexed>
            </pc:OrderedGroup>
        </pc:ReadingOrder>

Gutachten2-2 RO-groups-by-type xml

As you can see, PageViewer ignores the recursive group entries (for the tables) in the arrow path, but seems to add them at the end with position 0,0.

So, maybe PageViewer does not like the XML ordering by type, but wants it sorted by @index?

PAGE-XML snippet

        <pc:ReadingOrder>
            <pc:OrderedGroup id="reading-order">
                <pc:RegionRefIndexed index="0" regionRef="Gutachten2-2_region0001"/>
                <pc:OrderedGroupIndexed id="region0009_group" regionRef="region0009" index="1">
                    <pc:RegionRefIndexed index="0" regionRef="region0009_region0001"/>
                    <pc:RegionRefIndexed index="1" regionRef="region0009_region0002"/>
                    <pc:RegionRefIndexed index="2" regionRef="region0009_region0003"/>
                    <pc:RegionRefIndexed index="3" regionRef="region0009_region0004"/>
                    <pc:RegionRefIndexed index="4" regionRef="region0009_region0005"/>
                    <pc:RegionRefIndexed index="5" regionRef="region0009_region0006"/>
                    <pc:RegionRefIndexed index="6" regionRef="region0009_region0007"/>
                    <pc:RegionRefIndexed index="7" regionRef="region0009_region0008"/>
                </pc:OrderedGroupIndexed>
                <pc:RegionRefIndexed index="2" regionRef="Gutachten2-2_region0002"/>
                <pc:RegionRefIndexed index="3" regionRef="Gutachten2-2_region0003"/>
                <pc:RegionRefIndexed index="4" regionRef="Gutachten2-2_region0004"/>
                <pc:RegionRefIndexed index="5" regionRef="Gutachten2-2_region0005"/>
                <pc:RegionRefIndexed index="6" regionRef="Gutachten2-2_region0006"/>
                <pc:RegionRefIndexed index="7" regionRef="Gutachten2-2_region0007"/>
                <pc:RegionRefIndexed index="8" regionRef="Gutachten2-2_region0008"/>
                <pc:RegionRefIndexed index="9" regionRef="Gutachten2-2_region0009"/>
                <pc:OrderedGroupIndexed id="region0022_group" regionRef="region0022" index="10">
                    <pc:RegionRefIndexed index="0" regionRef="region0022_region0001"/>
                    <pc:RegionRefIndexed index="1" regionRef="region0022_region0002"/>
                    <pc:RegionRefIndexed index="2" regionRef="region0022_region0003"/>
                    <pc:RegionRefIndexed index="3" regionRef="region0022_region0004"/>
                    <pc:RegionRefIndexed index="4" regionRef="region0022_region0005"/>
                    <pc:RegionRefIndexed index="5" regionRef="region0022_region0006"/>
                </pc:OrderedGroupIndexed>
                <pc:RegionRefIndexed index="11" regionRef="Gutachten2-2_region0010"/>
                <pc:RegionRefIndexed index="12" regionRef="Gutachten2-2_region0011"/>
                <pc:RegionRefIndexed index="13" regionRef="Gutachten2-2_region0012"/>
                <pc:OrderedGroupIndexed id="region0030_group" regionRef="region0030" index="14">
                    <pc:RegionRefIndexed index="0" regionRef="region0030_region0001"/>
                    <pc:RegionRefIndexed index="1" regionRef="region0030_region0002"/>
                    <pc:RegionRefIndexed index="2" regionRef="region0030_region0003"/>
                    <pc:RegionRefIndexed index="3" regionRef="region0030_region0004"/>
                    <pc:RegionRefIndexed index="4" regionRef="region0030_region0005"/>
                    <pc:RegionRefIndexed index="5" regionRef="region0030_region0006"/>
                </pc:OrderedGroupIndexed>
                <pc:RegionRefIndexed index="15" regionRef="Gutachten2-2_region0013"/>
                <pc:RegionRefIndexed index="16" regionRef="Gutachten2-2_region0014"/>
            </pc:OrderedGroup>
        </pc:ReadingOrder>

Gutachten2-2 RO-groups-by-index xml

It seems that the order is better that way, but the position of the recursive group entries is still at 0,0.

How am I supposed to represent this?

Full file and image:
Gutachten2-2.zip

@chris1010010
Copy link
Contributor

The last XML snippet seems to be the correct way to do it.
There's a bug related to region nesting, similar to #13.
The renderer uses PageLayout.getRegion to determine coordinates of reading order groups (looking at region refs). But getRegion only returns root regions, not nested ones.

@chris1010010
Copy link
Contributor

Hi, I made a change, please test with the latest release.
C.

@bertsky
Copy link
Author

bertsky commented May 4, 2020

Hi, I made a change, please test with the latest release.
C.

I can confirm that with 1.4.06 everything works, including (a differently colored) arrow path within recursive groups. Thank you very much!

The last XML snippet seems to be the correct way to do it.

Okay, then maybe the documentation in the XSD should be amended to reflect that. Also, whether gaps in @index are allowed or not.

@kba I wonder how we can get the our generateDS object model to serialise the different member lists (RegionRefIndexed, OrderedGroupIndexed, UnorderedGroupIndexed) sorted by their (intermixed) @index...

@chris1010010
Copy link
Contributor

Thanks for the doc changes (the diff on GitHub was a bit messy and hard to read, but I trust you).
Index can be any ascending order, in our interpretation. The index is used only to determine the order of group members. Most XML handlers will be fine with the order of XML elements anyway, index is just there for correctness.

@bertsky
Copy link
Author

bertsky commented May 4, 2020

Thanks for the doc changes (the diff on GitHub was a bit messy and hard to read, but I trust you).

That's because I did a second commit where I reversed the order in which the two sets of group types get introduced – from indexed variants first, unindexed second to unindexed first, indexed second (because they are easier to understand and referenced on top-level ReadingOrder).

You should be fine looking at the commit diffs individually, the first commit in particular.

Index can be any ascending order, in our interpretation. The index is used only to determine the order of group members.

Well that's no small issue. In that case, contiguity is not required. Therefore, I should make another patch, replacing my statement about this by its inverse.

Most XML handlers will be fine with the order of XML elements anyway, index is just there for correctness.

Of course, we cannot enforce these details syntactically. But it is important how PRImA libs handle/expect this, being the reference implementation.

BTW we just have the unfortunate situation that the less strict interpretation of coordinate consistency in Aletheia causes our OCR-D validators to trigger tons of "false-positive" alarms about little polygon defects on annotations exported from Aletheia. (For example, small 1-pixel child-parent extrusions, rounding-related path self-intersections etc.) As soon as we can have a systematic account what sorts of errors can be automatically fixed, we will (make a post-processor and) report back.

@bertsky
Copy link
Author

bertsky commented May 4, 2020

Also, I wonder why the schema is so strict about the lower bounds of group size. That is, why is the subsequence choice of Un/OrderedGroup(Indexed) set to @minOccurs=1?

(It would make implementations much easier if groups could just be empty as well. Otherwise they have to be converted between groups and simple RegionRefs depending on what content some RO detector returned.)

@chris1010010 Give me a thumbs up if you would accept a PR making empty groups tolerable.

@chris1010010
Copy link
Contributor

@bertsky RegionRef for groups was added later, so empty groups did not make sense before that. But even now, I think we should keep that restriction. I'll ask the others, but I can already guess the response ;-)

@bertsky
Copy link
Author

bertsky commented May 6, 2020

RegionRef for groups was added later, so empty groups did not make sense before that.

I see.

Another question keeps swirling around in my head when I try to implement this in OCR-D Python: Are region IDs allowed to be referenced by multiple regionRefs (i.e. from different RegionRef(Indexed) or groups)? The XSD seems to be agnostic about that, and I have not found mention of the issue anywhere.

I am not asking whether there are already use-cases for this. (One can imagine people may want to have different "logics", like one purely page oriented including all footnotes and marginals, and another text-flow oriented only regarding the linear text body from page to page.) I just have to know whether this is allowed/intended to happen, and how PRImA's implementation handles this. (My current attempts always derive one global region ID dictionary from the recursive structure, which would of course fail if the answer was yes.)

@chris1010010
Copy link
Contributor

It's meant to be a reading order tree, therefore a region should only be referenced once.
Regarding implementation, do you mean when creating a reading order or parsing a reading order? I don't think the parser would complain. The GUI doesn't allow adding a region twice.
(I modified the order descriptions slightly)

@bertsky
Copy link
Author

bertsky commented May 6, 2020

It's meant to be a reading order tree, therefore a region should only be referenced once.

Understood. That makes it easier to implement. (See PRImA-Research-Lab/PAGE-XML#22)

Regarding implementation, do you mean when creating a reading order or parsing a reading order?

Both actually. And from my perspective the two tasks are not so different, because OCR-D processors always annotate incrementally, i.e. they have to parse and re-generate.

I don't think the parser would complain. The GUI doesn't allow adding a region twice.

Ok, that's good to know. So the parser and viewer are tolerant, but the generator and editor are strict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants