Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/add office files support #52

Merged
merged 5 commits into from
Jan 20, 2024

Conversation

demig00d
Copy link
Contributor

@demig00d demig00d commented Jan 19, 2024

Added support for docx and xlsx files. This PR addresses #10.

Documents are now parsed as plain text, but such an approach results in a loss of hyperlinks. To solve this problem we could consider parsing these files to a markdown format instead, @scambier, what do you think?

Parsing to markdown could be achieved by parsing docx to html (the mammoth lib I've added supports this) and then converting to markdown (this requires another external dependency, but could be useful if we plan to support html).

As for xlsx files, we could convert them to a csv format (the sheetjs lib I'm using gets the plain text from its csv function anyway) and then convert them to md.

Copy link
Contributor Author

@demig00d demig00d Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The onwarn function was added due to an issue encountered during the execution of pnpm run build. The problem arose from a Rollup warning stating that something had been rewritten to undefined.
  • The rollup-plugin-polyfill-node dependency was added to fix Missing global variable names warning

@scambier
Copy link
Owner

Thanks for that PR :)

To solve this problem we could consider parsing these files to a markdown format instead

If that's included in the lib, I guess there's no reason to not take advantage of it 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a renamed copy of pdf/pdf-manager.ts

@@ -8,16 +8,18 @@ const cpuCount = Platform.isMobileApp ? 1 : require('os').cpus().length

const ocrBackgroundProcesses = cpuCount == 2 ? 1 : 2

const officeBackgroundProcesses = 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should the number of processes be counted? I have temporarily set a dummy value here.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can leave it at 1.

It's a configurable value because initially I tried to balance the number of background workers between the OCR, PDF, and total number of available threads. The goal was to maximize the use of resources to extract files as quickly as possible.

It should probably be refactored, but in the meantime, 1 will be good enough.

@demig00d
Copy link
Contributor Author

Thanks for that PR :)

To solve this problem we could consider parsing these files to a markdown format instead

If that's included in the lib, I guess there's no reason to not take advantage of it 👍

Unfortunately this is not the case.

I suggested using intermediate formats from which we can get markdown (with additional dependencies), the thing is that all the js libraries that convert office files directly to markdown do the same under the hood. At least the ones I could find.

@scambier
Copy link
Owner

If you can manage to keep the URLs that's fine, but honestly I don't even know if they are handled correctly when extracting the text from PDFs either 🤷‍♂️

@demig00d demig00d marked this pull request as ready for review January 19, 2024 17:21
@demig00d
Copy link
Contributor Author

demig00d commented Jan 19, 2024

I'll try working with markdown in another PR then, if you don't mind.

Copy link
Owner

@scambier scambier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a closer look at the rollup stuff, but other than that it looks good. Thanks again :)

onmessage = async evt => {
try {
let text = ''
if (evt.data.extension == 'docx') {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use triple equality where applicable (I think I missed a few myself, but it's cleaner with === :p)

lib/src/index.ts Outdated
@@ -39,13 +42,19 @@ function isFileImage(path: string): boolean {
)
}

function isFileOffice(path: string): boolean {
return (
path.endsWith('.docx') || path.endsWith('xlsx')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be .xlsx with a dot, to be consistent

@scambier scambier merged commit 6de35ee into scambier:master Jan 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants