Skip to content

Website scraper for getting invoices automagically as pdf (useful for taxes or DMS)

License

Notifications You must be signed in to change notification settings

Joebinator/docudigger

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to docudigger 👋

npm GitHub package.json dependency version (subfolder of monorepo) License: MIT

Document scraper for getting invoices automagically as pdf (useful for taxes or DMS)

Prerequisites

  • npm >=9.1.2
  • node >=18.12.1

Configuration

All settings can be changed via CLI, env variable (even when using docker).

Setting Description Default value
AMAZON_USERNAME Your Amazon username null
AMAZON_PASSWORD Your amazon password null
AMAZON_TLD Amazon top level domain de
AMAZON_YEAR_FILTER Only extracts invoices from this year (i.e. 2023) 2023
AMAZON_PAGE_FILTER Only extracts invoices from this page (i.e. 2) null
ONLY_NEW Tracks already scraped documents and starts a new run at the last scraped one true
FILE_DESTINATION_FOLDER Destination path for all scraped documents ./documents/
FILE_FALLBACK_EXTENSION Fallback extension when no extension can be determined .pdf
DEBUG Debug flag (sets the log level to DEBUG) false
SUBFOLDER_FOR_PAGES Creates sub folders for every scraped page/plugin false
LOG_PATH Sets the log path ./logs/
LOG_LEVEL Log level (see https://github.com/winstonjs/winston#logging-levels) info
RECURRING Flag for executing the script periodically. Needs 'RECURRING_PATTERN' to be set. Default truewhen using docker container false
RECURRING_PATTERN Cron pattern to execute periodically. Needs RECURRING to true */30 * * * *
TZ Timezone used for docker environments Europe/Berlin

Install

⚠️ Attention: There is no need to install this locally. Just use npx

Usage

🔨 Make sure you have an .env file present (with the variables from above) in the work directory or use the appropriate cli arguments.

🚑 If you want to use an .env file, make sure you use env-cmd (https://www.npmjs.com/package/env-cmd)

$ npx docudigger COMMAND
running command...

$ npx docudigger (--version)
@disane-dev/docudigger/2.0.2 linux-x64 node-v18.16.1

$ npx docudigger --help [COMMAND]
USAGE
  $ docudigger COMMAND

docudigger scrape all

Scrapes all websites periodically (default for docker environment)

USAGE
  $ npx docudigger scrape all [--json] [--logLevel trace|debug|info|warn|error] [-d] [-l <value>] [-c <value> -r]

FLAGS
  -c, --recurringCron=<value>  [default: * * * * *] Cron pattern to execute periodically
  -d, --debug
  -l, --logPath=<value>        [default: ./logs/] Log path
  -r, --recurring
  --logLevel=<option>          [default: info] Specify level for logging.
                               <options: trace|debug|info|warn|error>

GLOBAL FLAGS
  --json  Format output as json.

DESCRIPTION
  Scrapes all websites periodically

EXAMPLES
  $ docudigger scrape all

docudigger scrape amazon

Used to get invoices from amazon

USAGE
  $ npx docudigger scrape amazon -u <value> -p <value> [--json] [--logLevel trace|debug|info|warn|error] [-d] [-l
    <value>] [-c <value> -r] [--fileDestinationFolder <value>] [--fileFallbackExentension <value>] [-t <value>]
    [--yearFilter <value>] [--pageFilter <value>] [--onlyNew]

FLAGS
  -c, --recurringCron=<value>        [default: * * * * *] Cron pattern to execute periodically
  -d, --debug
  -l, --logPath=<value>              [default: ./logs/] Log path
  -p, --password=<value>             (required) Password
  -r, --recurring
  -t, --tld=<value>                  [default: de] Amazon top level domain
  -u, --username=<value>             (required) Username
  --fileDestinationFolder=<value>    [default: ./data/] Amazon top level domain
  --fileFallbackExentension=<value>  [default: .pdf] Amazon top level domain
  --logLevel=<option>                [default: info] Specify level for logging.
                                     <options: trace|debug|info|warn|error>
  --onlyNew                          Gets only new invoices
  --pageFilter=<value>               Filters a page
  --yearFilter=<value>               Filters a year

GLOBAL FLAGS
  --json  Format output as json.

DESCRIPTION
  Used to get invoices from amazon

  Scrapes amazon invoices

EXAMPLES
  $ docudigger scrape amazon

Docker

docker run \ 
  -e AMAZON_USERNAME='[YOUR MAIL]' \ 
  -e AMAZON_PASSWORD='[YOUR PW]' \
  -e AMAZON_TLD='de' \ 
  -e AMAZON_YEAR_FILTER='2020' \
  -e AMAZON_PAGE_FILTER='1' \
  -e LOG_LEVEL='info' \
  -v "C:/temp/docudigger/:/home/node/docudigger" \
  ghcr.io/disane87/docudigger

Dev-Time 🪲

NPM

npm install
[Change created .env for your needs]
npm run start

Author

👤 Marco Franke

🤝 Contributing

Contributions, issues and feature requests are welcome!
Feel free to check issues page. You can also take a look at the contributing guide.

Show your support

Give a ⭐️ if this project helped you!


This README was generated with ❤️ by readme-md-generator

About

Website scraper for getting invoices automagically as pdf (useful for taxes or DMS)

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • TypeScript 88.8%
  • JavaScript 9.6%
  • Shell 1.5%
  • Batchfile 0.1%