This Repo aims at evaluating the (Document Intelligence+LLM) technique for entity extraction from Complex Tax Documents. We use schema2doc mapping based on Document Intelligence (DI) output of the processed document. DI provides a JSON or Markdown output format, including the styles information. Using LLM prompting, we ask the LLM (GPT4o) to process the DI output and provide a JSON format with the defined schema.
1. Azure account: Create azure account by [signing up here](https://azure.microsoft.com/)
2. Azure CLI: Install the Azure CLI from [here](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli).
Open your terminal and login to your Azure account:
az login
Follow the instructions to complete the authentication process. If you are using a specific subscription, set it as the default:
az account set --subscription "your-subscription-id"
az group create --name <resource_group_name> --location <region>
- In your Azure portal, click on “Create a resource”.
- Search for “OpenAI” and select it.
- Click on “Create” and fill in the necessary details such as name, subscription, resource group, etc.
- Click on “Review + Create” and then “Create” to create the resource.
- Once the deployment is complete, go to the resource page.
- Under “Keys and Endpoint”, you can find your key and endpoint. Save these for later use.
- In your Azure portal, click on “Create a resource”.
- Search for “Document Intelligence” and select it.
- Click on “Create” and fill in the necessary details.
- Click on “Review + Create” and then “Create” to create the resource.
- Once the deployment is complete, go to the resource page.
- Under “Keys and Endpoint”, you can find your key and endpoint. Save these for later use.
- In your Azure portal, click on “Create a resource”.
- Search for “Azure Search” and select it.
- Click on “Create” and fill in the necessary details.
- Click on “Review + Create” and then “Create” to create the resource.
- Once the deployment is complete, go to the resource page.
- Under “Keys and Endpoint”, you can find your key and endpoint. Save these for later use.
- In your Azure portal, Click on Azure Active Directory in the left-hand menu.
- Your Tenant ID is listed as Directory ID on the default page.
- In the Azure portal, click on App Registrations in the left-hand menu under Azure Active Directory.
- Click on New Registration at the top.
- Fill in the details such as name, supported account types, and redirect URI (if necessary), then click Register.
- After the app is registered, the Application (client) ID is displayed on the app page. This is your Client ID.
- To get the Client Secret, click on Certificates & secrets in the left-hand menu of the app page.
- Click on New client secret, add a description, select an expiry period, and click Add.
- After the client secret is created, copy the Value. This is your Client Secret.
-
Open your code editor or terminal.
-
Navigate to the root directory of your project.
-
Create a new file named
.env
. -
Open the
.env
file. -
Add your environment variables in the format
KEY=VALUE
, one per line. For example:AZURE_SUBSCRIPTION_ID=<your-subscription-id> AZURE_TENANT_ID=<your-tenant-id> AZURE_CLIENT_ID=<your-client-id> AZURE_CLIENT_SECRET=<your-client-secret> AZURE_OPENAI_ENDPOINT=<YOUR_RESOURCE_ENDPOINT_HERE> AZURE_OPENAI_API_KEY=<YOUR_RESOURCE_KEY_HERE> DOC_INTELLIGENCE_ENDPOINT=<YOUR_RESOURCE_ENDPOINT_HERE> DOC_INTELLIGENCE_KEY=<YOUR_RESOURCE_KEY_HERE> VECTOR_SEARCH_ENDPOINT=<YOUR_RESOURCE_ENDPOINT_HERE> VECTOR_SEARCH_KEY=<YOUR_RESOURCE_KEY_HERE> DEPLOYMENT_NAME=<YOUR_MODEL_DEPLOYMENT_NAME_HERE>
It is recommended that Python virtual environments are used for local branch development.
Then main advantage of using virtual environments is that you can create a separate workspace environment for a branch, so that yo can safely install, remove or upgrade a library without affecting other environments.
venv
docs: https://docs.python.org/3/library/venv.html
Create a new environment with Python version=3.11
conda create -n myenv python==3.11
Then, activate the environment
conda activate myenv
pip install -r requirements.txt
Refer to the notebook for initiating the Q&A process with your documents baseline.