Skip to content

Commit

Permalink
add pre-download models tool
Browse files Browse the repository at this point in the history
  • Loading branch information
jparkerweb committed Jul 12, 2024
1 parent 07c603f commit a7cec77
Show file tree
Hide file tree
Showing 5 changed files with 140 additions and 22 deletions.
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -225,3 +225,11 @@ The behavior of the `chunkit` function can be finely tuned using several optiona
- **Type**: Boolean
- **Default**: `true`
- **Description**: Indicates whether to use a quantized version of the specified model. Quantized models generally offer faster performance with a slight trade-off in accuracy, which can be beneficial when processing very large datasets.

---

## 💾 Pre-Downloading Models

Fill out the `tools/download-models.list.json` file with a list of models you want pre-downloaded, and if they are quantized or not (See the Curated ONNX Embedding Models section above for a list of models to try)

[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/JItZqsL3umY/0.jpg)](https://www.youtube.com/watch?v=JItZqsL3umY)
97 changes: 77 additions & 20 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 4 additions & 2 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "semantic-chunking",
"version": "1.2.1",
"version": "1.3.0",
"description": "Semantically create chunks from large texts. Useful for workflows involving large language models (LLMs).",
"repository": {
"type": "git",
Expand All @@ -23,11 +23,13 @@
"license": "ISC",
"scripts": {
"clean-models": "find ./models -type f ! -name '*.url' -delete",
"clean-models-win": "powershell -Command \"Get-ChildItem -Path ./models -Recurse | Where-Object { !$_.PSIsContainer -and $_.Extension -ne '.url' } | Remove-Item\""
"clean-models-win": "powershell -Command \"Get-ChildItem -Path ./models -Recurse | Where-Object { !$_.PSIsContainer -and $_.Extension -ne '.url' } | Remove-Item\"",
"download-models": "node ./tools/download-models.js"
},
"dependencies": {
"@stdlib/nlp-sentencize": "^0.2.1",
"@xenova/transformers": "^2.17.2",
"cli-progress": "^3.12.0",
"fs": "^0.0.1-security"
}
}
10 changes: 10 additions & 0 deletions tools/download-models-list.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
[
{
"modelName": "Xenova/all-MiniLM-L6-v2",
"quantized": true
},
{
"modelName": "BAAI/bge-small-en-v1.5",
"quantized": false
}
]
41 changes: 41 additions & 0 deletions tools/download-models.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import { pipeline, env as transformerCache } from "@xenova/transformers"
import cliProgress from "cli-progress"
import fs from "fs/promises"

const data = await fs.readFile('./tools/download-models-list.json', 'utf8');
const downloadModelsList = JSON.parse(data);


transformerCache.cacheDir = "models"

console.log(`Downloading embedding models…`)

const bar = new cliProgress.SingleBar({
clearOnComplete: false,
hideCursor: true,
format: "[{bar}] {value}% | {model}",
})


let started = false

for (const model of downloadModelsList) {
console.log(`model: ${model.modelName}, quantized: ${model.quantized}`)
bar.start(100, 0, { model: model.modelName})

await pipeline("feature-extraction", model.modelName, {
quantized: model.quantized,
progress_callback: (data) => {
started = started || data.status === "download"
if (!started) return

if (data.progress) bar.update(Math.floor(data.progress))
},
})
bar.update(100)
bar.stop()
}



console.log("Success!")

0 comments on commit a7cec77

Please sign in to comment.