benchmarking spike 2: QA Spec for evaluation processors #123

krvoigt · 2022-08-24T11:18:45Z

Concept for benchmarking / data for the workflow tab

Data

In a discussion we identified the following properties of a workspace as crucial: publication date, font, layout, pages.

Publication date: The idea is to provide data sets from all VD periods as well as modern texts to cover most of our known use cases
fonts: Antiqua and black letter are the most common for the VD periods. It would be beneficial to have some Greek and maybe even Hebrew examples, though, as Greek and Hebrew are "holy languages" and might therefore be quoted in some texts.
layout: Layout consideration should encompass title pages (that have a lot of decoration, capitals, etc.), multi columned pages, tables, binding pages, vacant pages, maps and sheet music
pages: the number of pages should range from 1 (leaflet) to 150 or even 300 page(s) (full monograph) to get an impression about a workflow's performance

Metadata for data sets

In the JSON representation of a workflow the data should be tagged to enable easy sorting/filtering
each data set needs a thorough description so that users can compare their own data to the sample data sets used for a workflow

Next steps for data

create data sets we want to use for workflows (re-using the ones I got from @kba)
create tags for them
(write a first description) (this step is important for the application's usability, but not pressing at this point)

Ground Truths

❓ At this point I'm not sure if we simple use an existing GT or if we create ones ourselves.

Workflows

The main idea of the workflow tab is to enable OCR-D users to identify suitable workflows for their data (where suitability means CER/WER and/or performance of the workflow). Since we have a lot of processors, it's not feasible to perform a simple permutation of all processors for all data sets. A good starting point might be to use the findings and recommendations the KIT had in the second project phase combined with examples obtained from people using OCR-D on a daily basis (Maria?).

The first evaluation of the workflow results could be done with dinglehopper, which is suitable for simple text evaluation.

Next steps for workflows

~~re-do the evaluation done by the KIT with newer processor versions and check if CER/WER and/or performance changed~~ (this doesn't seem feasible)
also consider newer processors in the evaluation
get in contact with Maria to talk about the workflows used on a day to day basis

Getting the data relevant for the front end

JSON Output

The dashboard should be fed with JSON containing all relevant information. A first draft of the data looks like this:

[
    {
        "workflow-id": "1",
        "ocrd-workspace": "https://some-url-pointing-to.a/mets.xml",
        "properties":
            {
                "font": "antiqua",
                "date-of-creation": "19. century",
                "no-of-pages": "100",
                "layout": "simple"
            },
        "workflow-metrics": "https://link-to-nextflow-results.com",
        "cer_total": "5.7",
        "cer_per_page": "0.92",
        "time_per_page_in_seconds": "15"        
    }
]

… and how to get it (maybe not in this spike)

In order to get a better understanding of how this is done, I will probably have to have a look at Nextflow and Mehmed's findings first.

Entscheidung für Metriken von Text und Layout treffen und kommunizieren
Ziel: Messbarkeit (OCR besser oder schlechter)

agree on metrics for text and layout
agree on the calculations that should be used
communicate the metrics and the calculations

AC for this sprint (30.08.2022)

we have a concept for doing the actual benchmark and well defined formats to express results
concept is accepted by community

The text was updated successfully, but these errors were encountered:

mweidling · 2022-09-05T06:30:23Z

@kba @paulpestov

Could you please give me some feedback regarding the JSON output suggested above?

paulpestov · 2022-09-05T12:39:12Z

Thank you for your example on the workflow JSON. I have questions about the comibnation of different aspects that we discussed on the basis of the Workflow Tab Mockup. So there we have a matrix view on the workflows that were grouped by benchmark types (in our case now "metrics"?) and models. As far as I understand it here you would return an array of workflow objects like above, right? How would we achieve that matrix view then? By which attributes should group the workflows? How do the crucial properties that you describe above play together with the matrix view?
Additional to that is it possible to intergrate the Nextflow results into that JSON in order to have a more standardized response?

mweidling · 2022-09-06T05:57:46Z

As far as I understand it here you would return an array of workflow objects like above, right? How would we achieve that matrix view then? By which attributes should group the workflows? How do the crucial properties that you describe above play together with the matrix view?

As far as I understood the draft users should be able to select different benchmark metrics according to their needs. Therefore I chose a rather flat nesting for the metrics that are relevant. My understanding was that the front end takes care of this sorting and displaying the workflows. Does that shift too much work to the web app?

Additional to that is it possible to intergrate the Nextflow results into that JSON in order to have a more standardized response?

I can add the relevant findings of Nextflow to the JSON output we produce instead of an URL if that is what you mean. If not, could you elaborate?

paulpestov · 2022-09-06T08:42:50Z

My understanding was that the front end takes care of this sorting and displaying the workflows.

Yea, so your array structure is fine. Also we have a list view of workflow where the array response is also useful.

... users should be able to select different benchmark metrics according to their needs.

But here I'm not quite sure. What we thought is to provide some filters, is it what you mean?
My questions were rather about the actual properties of the workflow array items. Like how do we represent the different combinations of workflow A, model A, metric A, you know?

I can add the relevant findings of Nextflow to the JSON output we produce instead of an URL if that is what you mean

Yes, that would be good.

mweidling · 2022-09-06T10:27:29Z

But here I'm not quite sure. What we thought is to provide some filters, is it what you mean?

Yes.

My questions were rather about the actual properties of the workflow array items. Like how do we represent the different combinations of workflow A, model A, metric A, you know?

I think we have to add two properties, the model and the workflow_steps, like so:

[
    {
        "workflow-id": "1",
        "workflow_steps":
            {
                "0": "Processor A",
                "1": "Processor B"
            }
        "ocrd-workspace": "https://some-url-pointing-to.a/mets.xml",
        "properties":
            {
                "font": "antiqua",
                "date-of-creation": "19. century",
                "no-of-pages": "100",
                "layout": "simple"
            },
        "wall_time": 1234,
        "cer_total": "5.7",
        "cer_per_page": "0.92",
        "time_per_page_in_seconds": 15       
    }
]

Is this what you mean? Or do you want several JSON files for each representation?

paulpestov · 2022-09-06T12:56:13Z

Is this what you mean? Or do you want several JSON files for each representation?

It might be an idea to have individual arrays for different things, like:

{
   "models": [
      {
           "id": 1,
           "name": "Model A"
      }
   ],
   "works": [
      {
           "id": 1,
           "name": "Work A"
      }
   ],
   "workflows": [
      {
           "id": 1,
           // ...
      }
   ],
}

Everything would be referenced by id then. So we can easily create filters out of it. Maybe also bookmark a certain filter by appending it to the URL &work=1. What do you think?

The workflow steps is some additional topic I think but yea we also need this. For me they are not that critical to create the main functionalities of the front-end. They just add some additional info about the workflow.

More important I would consider (if our plans haven't changed) basic rendering of the list and matrix views. So properties would be our work (like work A, maybe rename properties to work?), here we could also use some title for the display. And yes we would need some model attribute with an object as value similar to properties. What info do we have from the model?

mweidling · 2022-09-08T07:02:38Z

After a discussion we came to the conclusion that we stick with the originally proposed format, but

add information about the model
integrate @kba's proposed format (see OCR-D/spec@master...qa-spec)

mweidling · 2022-09-09T06:55:13Z

[
  {
    "workflow_id": "wf1-data345-eval1",
    "label": "Workflow 1 on Data 345",
    "metadata": {
      "workflow": "https://example.org/workflow/1",
      "workflow_steps": {
        "0": "Processor A",
        "1": "Processor B"
      },
      "workflow_model": "Fraktur_GT4HistOCR",
      "eval_workflow": "https://example.org/workflow/eval1",
      "eval_data": "https://example.org/workspace/345",
      "eval_tool": "dinglehopper",
      "gt_data": "https://gt.ocr-d.de/workspace/789",
      "document": {
        "fonts": ["antiqua", "fraktur"],
        "publication_year": "19. century",
        "number_of_pages": "100",
        "layout": "simple"
      }
    },
    "evaluations": {
      "document_wide": {
        "wall_time": 1234,
        "cer": 0.57,
        "cer_min_max": [0.2, 0.57]
      },
      "by_page": [
        {
          "page_id": "PHYS_0001",
          "cer": 0.8,
          "processing_time": 2.1
        }
      ]
    }
  },
  {
    "workflow_id": "wf2-data345-eval1",
    "label": "Workflow 2 on Data 345",
    "metadata": {
      "workflow": "https://example.org/workflow/2",
      "workflow_steps": {
        "0": "Processor A",
        "1": "Processor B"
      },
      "workflow_model": "Fraktur_GT4HistOCR",
      "eval_workflow": "https://example.org/workflow/eval1",
      "eval_data": "https://example.org/workspace/345",
      "eval_tool": "dinglehopper",
      "gt_data": "https://gt.ocr-d.de/workspace/789",
      "document": {
        "fonts": ["antiqua", "fraktur"],
        "publication_year": "19. century",
        "number_of_pages": "100",
        "layout": "simple"
      }
    },
    "evaluations": {
      "document_wide": {
        "wall_time": 1234,
        "cer": 0.88,
        "cer_min_max": [0.2, 0.57]
      },
      "by_page": [
        {
          "page_id": "PHYS_0001",
          "cer": 0.9,
          "processing_time": 2.0
        }
      ]
    }
  }
]

@kba @paulpestov I tried to integrate Konstantin's proposal and also added some key that might come in handy for the frond end app (e.g. workflow_model or eval_tool which could be used to display more information in the workflow tab).

EDIT: I added the second example of ocrd_eval.sample.yml.

kba · 2022-09-14T11:53:12Z

What is the difference between workflow_id in the top level and workflow in metadata?

I assume the former is the ID of the evaluation workflow and the later of the data generation workflow? Since both are to be Nextflow scripts, they should both be addressable as a URL via the Web API. Maybe we can rename them eval_workflow and ocr_workflow?

mweidling · 2022-09-15T07:26:18Z

What is the difference between workflow_id in the top level and workflow in metadata?

I assume the former is the ID of the evaluation workflow and the later of the data generation workflow? Since both are to be Nextflow scripts, they should both be addressable as a URL via the Web API. Maybe we can rename them eval_workflow and ocr_workflow?

I assumed that what you proposed in OCR-D/spec@master...qa-spec#diff-3ca00602cf767fb4a01ea3267035a87437cc087ccf4aecb252942431e9e1411bR1 should be an ID of a discrete evaluation workflow and just adopted the rest. :D So yeah, maybe we should be a bit verbose with our keys.

mweidling · 2022-09-15T07:33:05Z

@kba What about this:

[
  {
    "eval_workflow_id": "wf1-data345-eval1",
    "label": "Workflow 1 on Data 345", // for UI display
    "metadata": {
      "data_creation_workflow": "https://example.org/workflow/1",
      "workflow_steps": {
        "0": "Processor A",
        "1": "Processor B"
      },
      "workflow_model": "Fraktur_GT4HistOCR", // for UI display
      "eval_workflow_url": "https://example.org/workflow/eval1",
      "eval_data": "https://example.org/workspace/345",
      "eval_tool": "dinglehopper",
      "gt_data": "https://gt.ocr-d.de/workspace/789",
      "data_properties": {
        "fonts": ["antiqua", "fraktur"],
        "publication_year": "19. century",
        "number_of_pages": "100",
        "layout": "simple"
      }
    },
    "evaluation_results": {
      "document_wide": {
        "wall_time": 1234,
        "cer": 0.57,
        "cer_min_max": [0.2, 0.57]
      },
      "by_page": [
        {
          "page_id": "PHYS_0001",
          "cer": 0.8,
          "processing_time": 2.1
        }
      ]
    }
  },
  {
    "eval_workflow_id": "wf2-data345-eval1",
    "label": "Workflow 2 on Data 345",
    "metadata": {
      "data_creation_workflow": "https://example.org/workflow/2",
      "workflow_steps": {
        "0": "Processor A",
        "1": "Processor B"
      },
      "workflow_model": "Fraktur_GT4HistOCR",
      "eval_workflow_url": "https://example.org/workflow/eval1",
      "eval_data": "https://example.org/workspace/345",
      "eval_tool": "dinglehopper",
      "gt_data": "https://gt.ocr-d.de/workspace/789",
      "data_properties": {
        "fonts": ["antiqua", "fraktur"],
        "publication_year": "19. century",
        "number_of_pages": "100",
        "layout": "simple"
      }
    },
    "evaluation_results": {
      "document_wide": {
        "wall_time": 1234,
        "cer": 0.88,
        "cer_min_max": [0.2, 0.57]
      },
      "by_page": [
        {
          "page_id": "PHYS_0001",
          "cer": 0.9,
          "processing_time": 2.0
        }
      ]
    }
  }
]

Changes:

workflow_id --> eval_workflow_id
workflow --> data_creation_workflow
document --> data_properties
evaluations --> evaluation_results

krvoigt added the Epic label Aug 24, 2022

krvoigt assigned kba Aug 24, 2022

krvoigt changed the title ~~QA Spec for evaluation processors~~ benchmarking spike 2: QA Spec for evaluation processors Aug 31, 2022

krvoigt mentioned this issue Aug 31, 2022

Benchmarking Spike 1 (Basis for Kwalite Dashboard Workflow Tab) #107

Closed

4 tasks

krvoigt assigned mweidling Aug 31, 2022

mweidling mentioned this issue Sep 13, 2022

Add sample data for workflows to QUIVER back end #129

Closed

kba added a commit to OCR-D/spec that referenced this issue Sep 21, 2022

rewrite eval schema and saple according to OCR-D/zenhub#123

5aa6bd5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarking spike 2: QA Spec for evaluation processors #123

benchmarking spike 2: QA Spec for evaluation processors #123

krvoigt commented Aug 24, 2022 •

edited by mweidling

Loading

mweidling commented Sep 5, 2022

paulpestov commented Sep 5, 2022

mweidling commented Sep 6, 2022

paulpestov commented Sep 6, 2022 •

edited

Loading

mweidling commented Sep 6, 2022

paulpestov commented Sep 6, 2022

mweidling commented Sep 8, 2022

mweidling commented Sep 9, 2022 •

edited

Loading

kba commented Sep 14, 2022

mweidling commented Sep 15, 2022

mweidling commented Sep 15, 2022

benchmarking spike 2: QA Spec for evaluation processors #123

benchmarking spike 2: QA Spec for evaluation processors #123

Comments

krvoigt commented Aug 24, 2022 • edited by mweidling Loading

Concept for benchmarking / data for the workflow tab

Data

Metadata for data sets

Next steps for data

Ground Truths

Workflows

Next steps for workflows

Getting the data relevant for the front end

JSON Output

… and how to get it (maybe not in this spike)

AC for this sprint (30.08.2022)

mweidling commented Sep 5, 2022

paulpestov commented Sep 5, 2022

mweidling commented Sep 6, 2022

paulpestov commented Sep 6, 2022 • edited Loading

mweidling commented Sep 6, 2022

paulpestov commented Sep 6, 2022

mweidling commented Sep 8, 2022

mweidling commented Sep 9, 2022 • edited Loading

kba commented Sep 14, 2022

mweidling commented Sep 15, 2022

mweidling commented Sep 15, 2022

krvoigt commented Aug 24, 2022 •

edited by mweidling

Loading

paulpestov commented Sep 6, 2022 •

edited

Loading

mweidling commented Sep 9, 2022 •

edited

Loading