Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmarking spike 2: QA Spec for evaluation processors #123

Open
2 of 11 tasks
krvoigt opened this issue Aug 24, 2022 · 11 comments
Open
2 of 11 tasks

benchmarking spike 2: QA Spec for evaluation processors #123

krvoigt opened this issue Aug 24, 2022 · 11 comments
Assignees
Labels

Comments

@krvoigt
Copy link

krvoigt commented Aug 24, 2022

Concept for benchmarking / data for the workflow tab

Data

In a discussion we identified the following properties of a workspace as crucial: publication date, font, layout, pages.

  • Publication date: The idea is to provide data sets from all VD periods as well as modern texts to cover most of our known use cases
  • fonts: Antiqua and black letter are the most common for the VD periods. It would be beneficial to have some Greek and maybe even Hebrew examples, though, as Greek and Hebrew are "holy languages" and might therefore be quoted in some texts.
  • layout: Layout consideration should encompass title pages (that have a lot of decoration, capitals, etc.), multi columned pages, tables, binding pages, vacant pages, maps and sheet music
  • pages: the number of pages should range from 1 (leaflet) to 150 or even 300 page(s) (full monograph) to get an impression about a workflow's performance

Metadata for data sets

  • In the JSON representation of a workflow the data should be tagged to enable easy sorting/filtering
  • each data set needs a thorough description so that users can compare their own data to the sample data sets used for a workflow

Next steps for data

  • create data sets we want to use for workflows (re-using the ones I got from @kba)
  • create tags for them
  • (write a first description) (this step is important for the application's usability, but not pressing at this point)

Ground Truths

❓ At this point I'm not sure if we simple use an existing GT or if we create ones ourselves.

Workflows

The main idea of the workflow tab is to enable OCR-D users to identify suitable workflows for their data (where suitability means CER/WER and/or performance of the workflow). Since we have a lot of processors, it's not feasible to perform a simple permutation of all processors for all data sets. A good starting point might be to use the findings and recommendations the KIT had in the second project phase combined with examples obtained from people using OCR-D on a daily basis (Maria?).

The first evaluation of the workflow results could be done with dinglehopper, which is suitable for simple text evaluation.

Next steps for workflows

  • re-do the evaluation done by the KIT with newer processor versions and check if CER/WER and/or performance changed (this doesn't seem feasible)
  • also consider newer processors in the evaluation
  • get in contact with Maria to talk about the workflows used on a day to day basis

Getting the data relevant for the front end

JSON Output

The dashboard should be fed with JSON containing all relevant information. A first draft of the data looks like this:

[
    {
        "workflow-id": "1",
        "ocrd-workspace": "https://some-url-pointing-to.a/mets.xml",
        "properties":
            {
                "font": "antiqua",
                "date-of-creation": "19. century",
                "no-of-pages": "100",
                "layout": "simple"
            },
        "workflow-metrics": "https://link-to-nextflow-results.com",
        "cer_total": "5.7",
        "cer_per_page": "0.92",
        "time_per_page_in_seconds": "15"        
    }
]

… and how to get it (maybe not in this spike)

In order to get a better understanding of how this is done, I will probably have to have a look at Nextflow and Mehmed's findings first.

Entscheidung für Metriken von Text und Layout treffen und kommunizieren
Ziel: Messbarkeit (OCR besser oder schlechter)

  • agree on metrics for text and layout
  • agree on the calculations that should be used
  • communicate the metrics and the calculations

AC for this sprint (30.08.2022)

  • we have a concept for doing the actual benchmark and well defined formats to express results
  • concept is accepted by community
@krvoigt krvoigt added the Epic label Aug 24, 2022
@krvoigt krvoigt changed the title QA Spec for evaluation processors benchmarking spike 2: QA Spec for evaluation processors Aug 31, 2022
@mweidling
Copy link
Collaborator

@kba @paulpestov

Could you please give me some feedback regarding the JSON output suggested above?

@paulpestov
Copy link

Thank you for your example on the workflow JSON. I have questions about the comibnation of different aspects that we discussed on the basis of the Workflow Tab Mockup. So there we have a matrix view on the workflows that were grouped by benchmark types (in our case now "metrics"?) and models. As far as I understand it here you would return an array of workflow objects like above, right? How would we achieve that matrix view then? By which attributes should group the workflows? How do the crucial properties that you describe above play together with the matrix view?
Additional to that is it possible to intergrate the Nextflow results into that JSON in order to have a more standardized response?

@mweidling
Copy link
Collaborator

As far as I understand it here you would return an array of workflow objects like above, right? How would we achieve that matrix view then? By which attributes should group the workflows? How do the crucial properties that you describe above play together with the matrix view?

As far as I understood the draft users should be able to select different benchmark metrics according to their needs. Therefore I chose a rather flat nesting for the metrics that are relevant. My understanding was that the front end takes care of this sorting and displaying the workflows. Does that shift too much work to the web app?

Additional to that is it possible to intergrate the Nextflow results into that JSON in order to have a more standardized response?

I can add the relevant findings of Nextflow to the JSON output we produce instead of an URL if that is what you mean. If not, could you elaborate?

@paulpestov
Copy link

paulpestov commented Sep 6, 2022

My understanding was that the front end takes care of this sorting and displaying the workflows.

Yea, so your array structure is fine. Also we have a list view of workflow where the array response is also useful.

... users should be able to select different benchmark metrics according to their needs.

But here I'm not quite sure. What we thought is to provide some filters, is it what you mean?
My questions were rather about the actual properties of the workflow array items. Like how do we represent the different combinations of workflow A, model A, metric A, you know?

I can add the relevant findings of Nextflow to the JSON output we produce instead of an URL if that is what you mean

Yes, that would be good.

@mweidling
Copy link
Collaborator

But here I'm not quite sure. What we thought is to provide some filters, is it what you mean?

Yes.

My questions were rather about the actual properties of the workflow array items. Like how do we represent the different combinations of workflow A, model A, metric A, you know?

I think we have to add two properties, the model and the workflow_steps, like so:

[
    {
        "workflow-id": "1",
        "workflow_steps":
            {
                "0": "Processor A",
                "1": "Processor B"
            }
        "ocrd-workspace": "https://some-url-pointing-to.a/mets.xml",
        "properties":
            {
                "font": "antiqua",
                "date-of-creation": "19. century",
                "no-of-pages": "100",
                "layout": "simple"
            },
        "wall_time": 1234,
        "cer_total": "5.7",
        "cer_per_page": "0.92",
        "time_per_page_in_seconds": 15       
    }
]

Is this what you mean? Or do you want several JSON files for each representation?

@paulpestov
Copy link

Is this what you mean? Or do you want several JSON files for each representation?

It might be an idea to have individual arrays for different things, like:

{
   "models": [
      {
           "id": 1,
           "name": "Model A"
      }
   ],
   "works": [
      {
           "id": 1,
           "name": "Work A"
      }
   ],
   "workflows": [
      {
           "id": 1,
           // ...
      }
   ],
}

Everything would be referenced by id then. So we can easily create filters out of it. Maybe also bookmark a certain filter by appending it to the URL &work=1. What do you think?

The workflow steps is some additional topic I think but yea we also need this. For me they are not that critical to create the main functionalities of the front-end. They just add some additional info about the workflow.

More important I would consider (if our plans haven't changed) basic rendering of the list and matrix views. So properties would be our work (like work A, maybe rename properties to work?), here we could also use some title for the display. And yes we would need some model attribute with an object as value similar to properties. What info do we have from the model?

@mweidling
Copy link
Collaborator

After a discussion we came to the conclusion that we stick with the originally proposed format, but

@mweidling
Copy link
Collaborator

mweidling commented Sep 9, 2022

[
  {
    "workflow_id": "wf1-data345-eval1",
    "label": "Workflow 1 on Data 345",
    "metadata": {
      "workflow": "https://example.org/workflow/1",
      "workflow_steps": {
        "0": "Processor A",
        "1": "Processor B"
      },
      "workflow_model": "Fraktur_GT4HistOCR",
      "eval_workflow": "https://example.org/workflow/eval1",
      "eval_data": "https://example.org/workspace/345",
      "eval_tool": "dinglehopper",
      "gt_data": "https://gt.ocr-d.de/workspace/789",
      "document": {
        "fonts": ["antiqua", "fraktur"],
        "publication_year": "19. century",
        "number_of_pages": "100",
        "layout": "simple"
      }
    },
    "evaluations": {
      "document_wide": {
        "wall_time": 1234,
        "cer": 0.57,
        "cer_min_max": [0.2, 0.57]
      },
      "by_page": [
        {
          "page_id": "PHYS_0001",
          "cer": 0.8,
          "processing_time": 2.1
        }
      ]
    }
  },
  {
    "workflow_id": "wf2-data345-eval1",
    "label": "Workflow 2 on Data 345",
    "metadata": {
      "workflow": "https://example.org/workflow/2",
      "workflow_steps": {
        "0": "Processor A",
        "1": "Processor B"
      },
      "workflow_model": "Fraktur_GT4HistOCR",
      "eval_workflow": "https://example.org/workflow/eval1",
      "eval_data": "https://example.org/workspace/345",
      "eval_tool": "dinglehopper",
      "gt_data": "https://gt.ocr-d.de/workspace/789",
      "document": {
        "fonts": ["antiqua", "fraktur"],
        "publication_year": "19. century",
        "number_of_pages": "100",
        "layout": "simple"
      }
    },
    "evaluations": {
      "document_wide": {
        "wall_time": 1234,
        "cer": 0.88,
        "cer_min_max": [0.2, 0.57]
      },
      "by_page": [
        {
          "page_id": "PHYS_0001",
          "cer": 0.9,
          "processing_time": 2.0
        }
      ]
    }
  }
]

@kba @paulpestov I tried to integrate Konstantin's proposal and also added some key that might come in handy for the frond end app (e.g. workflow_model or eval_tool which could be used to display more information in the workflow tab).

EDIT: I added the second example of ocrd_eval.sample.yml.

@kba
Copy link
Member

kba commented Sep 14, 2022

What is the difference between workflow_id in the top level and workflow in metadata?

I assume the former is the ID of the evaluation workflow and the later of the data generation workflow? Since both are to be Nextflow scripts, they should both be addressable as a URL via the Web API. Maybe we can rename them eval_workflow and ocr_workflow?

@mweidling
Copy link
Collaborator

What is the difference between workflow_id in the top level and workflow in metadata?

I assume the former is the ID of the evaluation workflow and the later of the data generation workflow? Since both are to be Nextflow scripts, they should both be addressable as a URL via the Web API. Maybe we can rename them eval_workflow and ocr_workflow?

I assumed that what you proposed in OCR-D/spec@master...qa-spec#diff-3ca00602cf767fb4a01ea3267035a87437cc087ccf4aecb252942431e9e1411bR1 should be an ID of a discrete evaluation workflow and just adopted the rest. :D So yeah, maybe we should be a bit verbose with our keys.

@mweidling
Copy link
Collaborator

@kba What about this:

[
  {
    "eval_workflow_id": "wf1-data345-eval1",
    "label": "Workflow 1 on Data 345", // for UI display
    "metadata": {
      "data_creation_workflow": "https://example.org/workflow/1",
      "workflow_steps": {
        "0": "Processor A",
        "1": "Processor B"
      },
      "workflow_model": "Fraktur_GT4HistOCR", // for UI display
      "eval_workflow_url": "https://example.org/workflow/eval1",
      "eval_data": "https://example.org/workspace/345",
      "eval_tool": "dinglehopper",
      "gt_data": "https://gt.ocr-d.de/workspace/789",
      "data_properties": {
        "fonts": ["antiqua", "fraktur"],
        "publication_year": "19. century",
        "number_of_pages": "100",
        "layout": "simple"
      }
    },
    "evaluation_results": {
      "document_wide": {
        "wall_time": 1234,
        "cer": 0.57,
        "cer_min_max": [0.2, 0.57]
      },
      "by_page": [
        {
          "page_id": "PHYS_0001",
          "cer": 0.8,
          "processing_time": 2.1
        }
      ]
    }
  },
  {
    "eval_workflow_id": "wf2-data345-eval1",
    "label": "Workflow 2 on Data 345",
    "metadata": {
      "data_creation_workflow": "https://example.org/workflow/2",
      "workflow_steps": {
        "0": "Processor A",
        "1": "Processor B"
      },
      "workflow_model": "Fraktur_GT4HistOCR",
      "eval_workflow_url": "https://example.org/workflow/eval1",
      "eval_data": "https://example.org/workspace/345",
      "eval_tool": "dinglehopper",
      "gt_data": "https://gt.ocr-d.de/workspace/789",
      "data_properties": {
        "fonts": ["antiqua", "fraktur"],
        "publication_year": "19. century",
        "number_of_pages": "100",
        "layout": "simple"
      }
    },
    "evaluation_results": {
      "document_wide": {
        "wall_time": 1234,
        "cer": 0.88,
        "cer_min_max": [0.2, 0.57]
      },
      "by_page": [
        {
          "page_id": "PHYS_0001",
          "cer": 0.9,
          "processing_time": 2.0
        }
      ]
    }
  }
]

Changes:

  • workflow_id --> eval_workflow_id
  • workflow --> data_creation_workflow
  • document --> data_properties
  • evaluations --> evaluation_results

kba added a commit to OCR-D/spec that referenced this issue Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants