Feature/add office files support #52

demig00d · 2024-01-19T15:32:12Z

Added support for docx and xlsx files. This PR addresses #10.

Documents are now parsed as plain text, but such an approach results in a loss of hyperlinks. To solve this problem we could consider parsing these files to a markdown format instead, @scambier, what do you think?

Parsing to markdown could be achieved by parsing docx to html (the mammoth lib I've added supports this) and then converting to markdown (this requires another external dependency, but could be useful if we plan to support html).

As for xlsx files, we could convert them to a csv format (the sheetjs lib I'm using gets the plain text from its csv function anyway) and then convert them to md.

demig00d · 2024-01-19T15:43:53Z

lib/rollup.config.js

The onwarn function was added due to an issue encountered during the execution of pnpm run build. The problem arose from a Rollup warning stating that something had been rewritten to undefined.

The rollup-plugin-polyfill-node dependency was added to fix Missing global variable names warning

scambier · 2024-01-19T15:45:34Z

Thanks for that PR :)

To solve this problem we could consider parsing these files to a markdown format instead

If that's included in the lib, I guess there's no reason to not take advantage of it 👍

demig00d · 2024-01-19T15:46:31Z

lib/src/office/office-manager.ts

It's just a renamed copy of pdf/pdf-manager.ts

demig00d · 2024-01-19T15:54:15Z

lib/src/globals.ts

@@ -8,16 +8,18 @@ const cpuCount = Platform.isMobileApp ? 1 : require('os').cpus().length

 const ocrBackgroundProcesses = cpuCount == 2 ? 1 : 2

+const officeBackgroundProcesses = 1


How should the number of processes be counted? I have temporarily set a dummy value here.

You can leave it at 1.

It's a configurable value because initially I tried to balance the number of background workers between the OCR, PDF, and total number of available threads. The goal was to maximize the use of resources to extract files as quickly as possible.

It should probably be refactored, but in the meantime, 1 will be good enough.

demig00d · 2024-01-19T16:40:42Z

Thanks for that PR :)

To solve this problem we could consider parsing these files to a markdown format instead

If that's included in the lib, I guess there's no reason to not take advantage of it 👍

Unfortunately this is not the case.

I suggested using intermediate formats from which we can get markdown (with additional dependencies), the thing is that all the js libraries that convert office files directly to markdown do the same under the hood. At least the ones I could find.

scambier · 2024-01-19T16:56:11Z

If you can manage to keep the URLs that's fine, but honestly I don't even know if they are handled correctly when extracting the text from PDFs either 🤷‍♂️

demig00d · 2024-01-19T17:22:33Z

I'll try working with markdown in another PR then, if you don't mind.

scambier

I'll take a closer look at the rollup stuff, but other than that it looks good. Thanks again :)

scambier · 2024-01-19T17:42:48Z

lib/src/office/office-worker.ts

+onmessage = async evt => {
+  try {
+    let text = ''
+    if (evt.data.extension == 'docx') {


Please use triple equality where applicable (I think I missed a few myself, but it's cleaner with === :p)

scambier · 2024-01-19T17:50:31Z

lib/src/index.ts

@@ -39,13 +42,19 @@ function isFileImage(path: string): boolean {
  )
 }

+function isFileOffice(path: string): boolean {
+  return (
+    path.endsWith('.docx') || path.endsWith('xlsx')


It should be .xlsx with a dot, to be consistent

demig00d added 2 commits January 19, 2024 15:31

Add docx support

335b3f9

Add xlsx support

b89ce29

demig00d commented Jan 19, 2024

View reviewed changes

lib/src/office/office-manager.ts Outdated

Copy link

Contributor Author

demig00d Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a renamed copy of pdf/pdf-manager.ts

demig00d commented Jan 19, 2024

View reviewed changes

demig00d marked this pull request as ready for review January 19, 2024 17:21

Update README

c02b68a

scambier reviewed Jan 19, 2024

View reviewed changes

demig00d added 2 commits January 19, 2024 21:03

Use triple equality

013daf9

Add a missing dot for the file extension

1d2da35

scambier merged commit 6de35ee into scambier:master Jan 20, 2024

demig00d mentioned this pull request Jan 20, 2024

Index office documents scambier/obsidian-omnisearch#340

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/add office files support #52

Feature/add office files support #52

demig00d commented Jan 19, 2024 •

edited

Loading

demig00d Jan 19, 2024 •

edited

Loading

scambier commented Jan 19, 2024

demig00d Jan 19, 2024

demig00d Jan 19, 2024

scambier Jan 19, 2024

demig00d commented Jan 19, 2024

scambier commented Jan 19, 2024

demig00d commented Jan 19, 2024 •

edited

Loading

scambier left a comment

scambier Jan 19, 2024

scambier Jan 19, 2024

		@@ -8,16 +8,18 @@ const cpuCount = Platform.isMobileApp ? 1 : require('os').cpus().length

		const ocrBackgroundProcesses = cpuCount == 2 ? 1 : 2

		const officeBackgroundProcesses = 1

Feature/add office files support #52

Feature/add office files support #52

Conversation

demig00d commented Jan 19, 2024 • edited Loading

demig00d Jan 19, 2024 • edited Loading

Choose a reason for hiding this comment

scambier commented Jan 19, 2024

demig00d Jan 19, 2024

Choose a reason for hiding this comment

demig00d Jan 19, 2024

Choose a reason for hiding this comment

scambier Jan 19, 2024

Choose a reason for hiding this comment

demig00d commented Jan 19, 2024

scambier commented Jan 19, 2024

demig00d commented Jan 19, 2024 • edited Loading

scambier left a comment

Choose a reason for hiding this comment

scambier Jan 19, 2024

Choose a reason for hiding this comment

scambier Jan 19, 2024

Choose a reason for hiding this comment

demig00d commented Jan 19, 2024 •

edited

Loading

demig00d Jan 19, 2024 •

edited

Loading

demig00d commented Jan 19, 2024 •

edited

Loading