Skip to content

Extraction

Sam edited this page Jun 1, 2023 · 5 revisions

Extraction features

All the excel examples are located within the workbook called basic_examples.xlsx.

All the definitions are located within the extraction folder in a form of integration tests

We will start with the most simple case located in simple spreadsheet — the way Excel spreadsheets should be actually used. There is only a single table here that needs to be parsed, so let's do that!

The whole workbook definition looks like that

val definition = WorkbookParserDefinition(
            spreadsheetParserDefinitions = listOf(
                SheetParserDefinition(
                    sheetNameFilter = { it == "simple spreadsheet" },
                    tableDefinitions = listOf(
                        TableParserDefinition(
                            requiredColumns = setOf(
                                StringHeaderCell("team"),
                                StringHeaderCell("plays"),
                                StringHeaderCell("wins"),
                                StringHeaderCell("goal diff"),
                                StringHeaderCell("points"),
                                StringHeaderCell("date")
                            )
                        )
                    )
                )
            )
        )

Let's tackle this big chunk of code object-by-object now!

WorkbookParserDefinition is the entrypoint of all defintions. The most important thing to know for now is that it contains a list of spreadsheet definition. And each spreadsheet parser definition contains a list of table parser defitions.

As we have multiple spreadsheets within a workbook, we applied sheetNameFilter to only parse a spreadsheet we needed. This is a lambda function that is simply checked against the name of the spreadsheet.

The most interesting part for us here is TableDefinition

TableParserDefinition(
    requiredColumns = setOf(
        StringHeaderCell("team"),
        StringHeaderCell("plays"),
        StringHeaderCell("wins"),
        StringHeaderCell("goal diff"),
        StringHeaderCell("points"),
        StringHeaderCell("date")
    )
)

This way you define the header row so that refinery can locate it and further resolve the table location (start row, header row, end row).

Then we will use supply the WorkbookParser with definition and workbook and invoke parse method to get the list of parsed records back.

val workbook = WorkbookFactory.create(file)
val parsedRecords = WorkbookParser(definition, workbook).parse()

By default, all parsed records will be of type GenericRecord which has a Map<String, Any> data holder inside In our case we got 4 records back, and in the assertion block you can see how they look like

Note, that apart from the data from the table another useful metadata is extracted such as spreadsheet name and row number.

Optional columns

You will notice that the Definition support optional collumns as well. These can be used to define the set of columns that are not always required to be there in order to identify a header row. However when they exist you still want to use them, for example in your parser (more on that later).

Let's look at a bit more complex example now in spreadsheet multiple tables. This spreadsheet contains 2 tables: one with the stats of the teams in group A, another one with all matches played within a group and fixtures.

To extract the data from both of the tables we need to define 2 TableParserDefinitions and add both of them into the list of all table defintions

SheetParserDefinition(   
    tableDefinitions = listOf(teamStatsTableDefinition, teamPlaysTableDefinition),
    ...
)

As you can see we got 12 records back in parsedRecords. 4 of them are related to team stat rows, another 8 are related to team plays.

Now as we know how to parse multiple tables, let's explore the data with the anchors.

Here we have the same headers but the defined twice related team stats in different groups (Group A and Group B). To handle that case we need to add extra param anchor in the table definition:

SheetParserDefinition(
    tableDefinitions = listOf(
        // 1st table
        TableParserDefinition(
            requiredColumns = headerColumns,
            anchor = "Group A"
        ),
        // 2nd table
        TableParserDefinition(
            requiredColumns = headerColumns,
            anchor = "Group B"
        )
    ),
    ...
)

Note that once the data is parsed the anchor information is appeared in each GenericRecord as an extracted metadata.

Another way to represent multiple groups (weird though) is to have group as a divider within a single table, and refinery is allowing to extract this data as well!

Simply add parameter hasDivider = true and voila!

SheetParserDefinition(
    tableDefinitions = listOf(
        TableParserDefinition(
            requiredColumns = headerColumns,
            hasDivider = true
        )
    ),
    ...
)

Now groups information is extracted as a value of divider key

Table with constrained dividers

The hasDivider option will use any single-cell row within a table as a divider. Sometimes we wish to limit which of these of we actually want to recognise. For this we can add the allowedDividers argument, which uses AbstractHeaderCell to do matching. Of all possible dividers found, only those that match one of these possible allowedDividers will be used - anything else will be skipped.

SheetParserDefinition(
    tableDefinitions = listOf(
        TableParserDefinition(
            requiredColumns = headerColumns,
            hasDivider = true,
            allowedDividers = setOf(StringHeaderCell("Group A"), StringHeaderCell("Group B"))
        )
    ),
    ...
)

Merged cells are handled by refinery as well, so you will see plays attribute in each GenericRecord being equal to 3.

Merged cells are a bit more tricky, as for a 1 merged cells header you might have N columns with different data in cells. To handle that refinery adds the postfix with the column number. E.g. in our case score is located in column 3 and 4, so the data is extracted into score_3 and score_4 keys.