From bc99ec81288e222de8837fbf1e83975d53d7e456 Mon Sep 17 00:00:00 2001
From: Paul Van Eck <paulvaneck@microsoft.com>
Date: Fri, 24 Mar 2023 15:49:15 -0700
Subject: [PATCH] [Monitor][Query] Add sample notebook for large queries
 (#28148)

A sample notebook was added to demonstrate how a user can
overcome Log Analytics service limits when trying to query
large amounts of data.

Signed-off-by: Paul Van Eck <paulvaneck@microsoft.com>
---
 .../azure-monitor-query/samples/README.md     |   5 +
 .../notebooks/sample_large_query.ipynb        | 592 ++++++++++++++++++
 2 files changed, 597 insertions(+)
 create mode 100644 sdk/monitor/azure-monitor-query/samples/notebooks/sample_large_query.ipynb

diff --git a/sdk/monitor/azure-monitor-query/samples/README.md b/sdk/monitor/azure-monitor-query/samples/README.md
index 9382c9609375..7256329759ec 100644
--- a/sdk/monitor/azure-monitor-query/samples/README.md
+++ b/sdk/monitor/azure-monitor-query/samples/README.md
@@ -23,6 +23,11 @@ The following code samples show common scenarios with the Azure Monitor Query cl
 - [Send multiple queries with LogsQueryClient](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/monitor/azure-monitor-query/samples/sample_batch_query.py)
 - [Send a single query with LogsQueryClient using server timeout](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/monitor/azure-monitor-query/samples/sample_server_timeout.py)
 
+#### Notebook samples
+
+- [Split a large query into multiple smaller queries to avoid hitting service limits](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/monitor/azure-monitor-query/samples/notebooks/sample_large_query.ipynb)
+
+
 ### Metrics query samples
 
 - [Send a query using MetricsQueryClient](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/monitor/azure-monitor-query/samples/sample_metrics_query.py) ([async sample](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/monitor/azure-monitor-query/samples/async_samples/sample_metrics_query_async.py))
diff --git a/sdk/monitor/azure-monitor-query/samples/notebooks/sample_large_query.ipynb b/sdk/monitor/azure-monitor-query/samples/notebooks/sample_large_query.ipynb
new file mode 100644
index 000000000000..34ba311ba1e8
--- /dev/null
+++ b/sdk/monitor/azure-monitor-query/samples/notebooks/sample_large_query.ipynb
@@ -0,0 +1,592 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Query with large result sets\n",
+    "\n",
+    "This sample notebook demonstrates how to query large amounts of data using the Azure Monitor Query client library.\n",
+    "\n",
+    "Due to Log Analytics [service limits](https://learn.microsoft.com/azure/azure-monitor/service-limits#la-query-api), sometimes it may not be possible to retrieve all the expected data in a single query. For example, the number of rows returned or the maximum size of the data returned may exceed the stated limits. One approach for overcoming these limits is to split the queries into multiple smaller queries using different time ranges.\n",
+    "\n",
+    "In this notebook, you will learn how your data in a Log Analytics workspace can first be queried to determine the time ranges that can be used to split the data retrieval into multiple smaller queries without exceeding the service limits. Then, you will asynchronously execute the smaller queries and combine the results into a single pandas DataFrame which can be used for further analysis. Afterwards, this notebook also shows how to export the data to an [Azure Data Lake Storage (ADLS)](https://learn.microsoft.com/azure/storage/blobs/data-lake-storage-introduction) account.\n",
+    "\n",
+    "**Disclaimer**: This approach of splitting data retrieval into multiple smaller queries is good when dealing with a few GB or a few millions of records per hour. For larger data sets, [exporting](https://learn.microsoft.com/azure/azure-monitor/logs/logs-data-export) is recommended."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Getting started\n",
+    "\n",
+    "For this notebook, it is assumed that you have an existing Azure subscription and an active [Log Analytics workspace](https://learn.microsoft.com/azure/azure-monitor/logs/log-analytics-workspace-overview) that contains at least one table populated with data.\n",
+    "\n",
+    "Start by installing the Azure Monitor Query and Azure Identity client libraries along with the `pandas` data analysis library."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "\n",
+    "!{sys.executable} -m pip install --upgrade azure-monitor-query azure-identity pandas"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Setup\n",
+    "\n",
+    "An authenticated client is required to query data from Log Analytics. The following code shows how to create a `LogsQueryClient` using the `DefaultAzureCredential`. Note that an async credential and client are used."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from azure.identity.aio import DefaultAzureCredential\n",
+    "from azure.monitor.query.aio import LogsQueryClient\n",
+    "\n",
+    "credential = DefaultAzureCredential()\n",
+    "client = LogsQueryClient(credential)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Set Log Analytics workspace ID\n",
+    "\n",
+    "Set the `LOGS_WORKSPACE_ID` variable below to the ID of your Log Analytics workspace."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "LOGS_WORKSPACE_ID = \"<workspace_id>\""
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Define helper functions\n",
+    "\n",
+    "In order to overcome the service limits, the strategy is to query data in smaller chunks based on some time column (i.e. `TimeGenerated`). The following helper functions are useful for this by querying your data in order to find suitable start and end times for the batch queries.\n",
+    "\n",
+    "- The `get_batch_endpoints_by_row_count` function will return a list of times that can be used in the query time spans while ensuring that the number of rows returned will be less than the specified row limit. \n",
+    "- The `get_batch_endpoints_by_size` function will return a list of times that can be used in the query time spans while ensuring that the size of the data returned is less than the specified byte size limit."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datetime import datetime, timedelta\n",
+    "\n",
+    "import pandas as pd\n",
+    "\n",
+    "from azure.core.exceptions import HttpResponseError\n",
+    "from azure.monitor.query import LogsQueryStatus\n",
+    "\n",
+    "\n",
+    "async def get_batch_endpoints_by_row_count(\n",
+    "    query: str,\n",
+    "    end_time: datetime, \n",
+    "    days_back: int, \n",
+    "    max_rows_per_query: int = int(1e5),\n",
+    "    time_col: str = \"TimeGenerated\",\n",
+    "):\n",
+    "    \"\"\"\n",
+    "    Determine the timestamp endpoints for each chunked query\n",
+    "    such that number of rows returned by each query is (approximately) `max_rows_per_query`\n",
+    "    \"\"\"\n",
+    "\n",
+    "    # This query will assign a batch number to each row depending on the maximum number of rows per query.\n",
+    "    # Then the earliest timestamp for each batch number is used for each query endpoint.\n",
+    "    find_batch_endpoints_query = f\"\"\"\n",
+    "        {query}\n",
+    "        | sort by {time_col} desc\n",
+    "        | extend batch_num = row_cumsum(1) / {max_rows_per_query}\n",
+    "        | summarize endpoint=min({time_col}) by batch_num\n",
+    "        | sort by batch_num asc\n",
+    "        | project endpoint\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    start_time = end_time - timedelta(days=days_back)\n",
+    "    try:\n",
+    "        response = await client.query_workspace(\n",
+    "            workspace_id=LOGS_WORKSPACE_ID,\n",
+    "            query=find_batch_endpoints_query,\n",
+    "            timespan=(start_time, end_time),\n",
+    "        )\n",
+    "    except HttpResponseError as e:\n",
+    "        print(\"Error batching endpoints by row count\")\n",
+    "        raise e\n",
+    "\n",
+    "    batch_endpoints = [end_time]\n",
+    "    batch_endpoints += [row[0] for row in response.tables[0].rows]\n",
+    "    return batch_endpoints\n",
+    "\n",
+    "\n",
+    "async def get_batch_endpoints_by_byte_size(\n",
+    "    query: str,\n",
+    "    end_time: datetime, \n",
+    "    days_back: int,\n",
+    "    max_bytes_per_query: int = 100 * 1024 * 1024, # 100 MiB\n",
+    "    time_col: str = \"TimeGenerated\",\n",
+    "):\n",
+    "    \"\"\"\n",
+    "    Determine the timestamp endpoints for each chunked query such that\n",
+    "    the size of the data returned is less than `max_bytes_per_query`.\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    # This query will assign a batch number to each row depending on the estimated data size.\n",
+    "    # Then the earliest timestamp for each batch number is used for each query endpoint.\n",
+    "    find_batch_endpoints_query = f\"\"\"\n",
+    "        {query}\n",
+    "        | sort by {time_col} desc\n",
+    "        | extend batch_num = row_cumsum(estimate_data_size(*)) / {max_bytes_per_query}\n",
+    "        | summarize endpoint=min({time_col}) by batch_num\n",
+    "        | sort by batch_num asc\n",
+    "        | project endpoint\n",
+    "    \"\"\"\n",
+    "\n",
+    "    start_time = end_time - timedelta(days=days_back)\n",
+    "    try:\n",
+    "        response = await client.query_workspace(\n",
+    "            workspace_id=LOGS_WORKSPACE_ID,\n",
+    "            query=find_batch_endpoints_query,\n",
+    "            timespan=(start_time, end_time)\n",
+    "        )\n",
+    "    except HttpResponseError as e:\n",
+    "        print(\"Error batching endpoints by byte size\")\n",
+    "        raise e\n",
+    "\n",
+    "    batch_endpoints = [end_time]\n",
+    "    batch_endpoints += [row[0] for row in response.tables[0].rows]\n",
+    "    return batch_endpoints"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, define a function that will asynchronously execute a given query over a time range specified by a given start time and end time. This function will call the `query_workspace` method of the `LogsQueryClient`. The Azure Monitor Query library will automatically handle retries in case of connection-related errors or server errors (i.e. 500, 503, and 504 status codes). Check [here](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/core/azure-core#configurations) for more information on configuring retries."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "async def execute_query(\n",
+    "    query: str, \n",
+    "    start_time: datetime, \n",
+    "    end_time: datetime, \n",
+    "    *, \n",
+    "    query_id: str = \"\",\n",
+    "    correlation_request_id: str = \"\",\n",
+    "):\n",
+    "    \"\"\"\n",
+    "    Asynchronously execute the given query, restricted to the given time range, and parse the API response.\n",
+    "\n",
+    "    :param str query: The query to execute.\n",
+    "    :param datetime start_time: The start of the time range to query.\n",
+    "    :param datetime end_time: The end of the time range to query.\n",
+    "    :keyword str query_id: Optional identifier for the query, used for printing.\n",
+    "    :keyword str correlation_request_id, Optional correlation ID to use in the query headers for tracing.\n",
+    "    \"\"\"\n",
+    "    headers = {}\n",
+    "    if correlation_request_id:\n",
+    "        headers[\"x-ms-correlation-request-id\"] = correlation_request_id\n",
+    "\n",
+    "    try:\n",
+    "        response = await client.query_workspace(\n",
+    "            workspace_id=LOGS_WORKSPACE_ID,\n",
+    "            query=query,\n",
+    "            timespan=(start_time, end_time),\n",
+    "            server_timeout=360,                 \n",
+    "            include_statistics=False, # Can be used for debugging.\n",
+    "            headers=headers,\n",
+    "            retry_on_methods=[\"POST\"]\n",
+    "        )\n",
+    "    except HttpResponseError as e:\n",
+    "        print(f\"Error when attempting query {query_id} (query time span: {start_time} to {end_time}):\\n\\t\", e)\n",
+    "        return []\n",
+    "\n",
+    "    if response.status == LogsQueryStatus.SUCCESS:\n",
+    "        print(f\"Query {query_id} successful (query time span: {start_time} to {end_time}). Row count: {len(response.tables[0].rows)}\")\n",
+    "        return response.tables[0]\n",
+    "    else:\n",
+    "        # This will be a LogsQueryPartialResult.\n",
+    "        error = response.partial_error\n",
+    "        print(f\"Partial results returned for query {query_id} (query time span: {start_time} to {end_time}):\\n\\t\", error)\n",
+    "        return response.partial_data[0]"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Query data\n",
+    "\n",
+    "With the helper functions defined, you can now query the data in chunks that won't hit the row count and data size service limits.\n",
+    "\n",
+    "### Set variables\n",
+    "\n",
+    "Before running the queries, some variables will need to be configured.\n",
+    "\n",
+    "- `QUERY` - KQL query to run. Change the table name and specify any required columns and filters as needed. When constructing this query, the recommendation is to use [reduced KQL](https://learn.microsoft.com/azure/azure-monitor/logs/basic-logs-query?tabs=portal-1#kql-language-limits) which are optimized for data retrieval. To get all rows/columns, just set `QUERY = <name-of-table>`. \n",
+    "- `END_TIME` - End of the time range to query over.\n",
+    "- `DAYS_BACK` - The number of days to go back from the end time. For example, if `END_TIME = datetime.now()` and `DAYS_BACK = 7`, the query will return data from the last 7 days. Note that fetched data will (initially) be stored in memory on your system, so it is possible to run into memory limitations if the query returns a large amount of data. If this issue is encountered, consider querying the data in time segments. For example, instead of querying 365 days of data at once, query 100 days of data at a time by setting the values of `END_TIME` and `DAYS_BACK` appropriately and re-running the notebook from this cell onwards for each separate segment.\n",
+    "- `MAX_ROWS_PER_QUERY` - The max number of rows that is returned from a single query. This is defaulted to the service limit of 500,000 rows multiplied by some factor to allow for some wiggle room. This limit may sometimes be exceeded if many entries share the same timestamp.\n",
+    "- `MAX_BYTES_PER_QUERY` - The max size in bytes of data returned from a single query. This is defaulted to the service limit of 100 MiB multiplied by some factor to allow for some wiggle room.\n",
+    "- `MAX_CONCURRENT_QUERIES` - The max number of concurrent queries to run at once. This is defaulted to 5. Reducing this can help avoid errors due to rate limits."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# EDIT THIS VALUE WITH YOUR QUERY.\n",
+    "# If necessary, add a KQL `project` operator or any filtering operators to limit the number of rows returned.\n",
+    "QUERY = \"AppRequests\" \n",
+    "\n",
+    "# Use the current time in the system's local timezone as the end time.\n",
+    "END_TIME = datetime.now().astimezone()\n",
+    "\n",
+    "# If you want to use a different end time, uncomment the following line and adjust as needed.\n",
+    "# END_TIME = datetime.strptime(\"2023-01-01 00:00:00 +0000\", \"%Y-%m-%d %H:%M:%S %z\")\n",
+    "\n",
+    "DAYS_BACK = 90\n",
+    "\n",
+    "MAX_ROWS_PER_QUERY_SERVICE_LIMIT = int(5e5)  # 500K\n",
+    "MAX_ROWS_PER_QUERY = int(MAX_ROWS_PER_QUERY_SERVICE_LIMIT * 0.9)\n",
+    "\n",
+    "MAX_BYTES_PER_QUERY_SERVICE_LIMIT = 100 * 1024 * 1024 # 100 MiB of compressed data\n",
+    "MAX_BYTES_PER_QUERY = int(MAX_BYTES_PER_QUERY_SERVICE_LIMIT * 0.6)\n",
+    "\n",
+    "MAX_CONCURRENT_QUERIES = 5"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Estimate data and costs (optional)\n",
+    "\n",
+    "Before running the chunked queries, it might first be prudent to estimate the size of the data if planning on exporting the query results to another service. The below cell defines another helper function that can be used to estimate the size of the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "async def estimate_data_size(query: str, end_time: datetime, days_back: int):\n",
+    "    query = f\"{query} | summarize n_rows = count(), estimate_data_size = sum(estimate_data_size(*))\"\n",
+    "    start_time = end_time - timedelta(days=days_back)\n",
+    "    response = await client.query_workspace(\n",
+    "        workspace_id=LOGS_WORKSPACE_ID,\n",
+    "        query=query,\n",
+    "        timespan=(start_time, end_time),\n",
+    "    )\n",
+    "\n",
+    "    columns = response.tables[0].columns\n",
+    "    rows = response.tables[0].rows\n",
+    "    df = pd.DataFrame(data=rows, columns=columns)\n",
+    "    return df"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, run the following cell to estimate the size of the data that will be returned by the queries. Note that this is just an estimate and the actual size may vary slightly. This information can be used in conjunction with the Azure storage [pricing calculator](https://azure.microsoft.com/pricing/calculator/?service=storage) to determine costs that will be incurred for your storage setup. If using Azure Data Lake Storage Gen2, full billing details can be found [here](https://azure.microsoft.com/pricing/details/storage/data-lake/)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_size_df = await estimate_data_size(QUERY, END_TIME, DAYS_BACK)\n",
+    "data_size_df[\"estimate_data_size_MB\"] = data_size_df[\"estimate_data_size\"] / (1000 **2)\n",
+    "data_size_df[\"estimate_data_size_GB\"] = data_size_df[\"estimate_data_size_MB\"] / 1000\n",
+    "data_size_df"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Fetch log data\n",
+    "\n",
+    "Use the helper functions to create an async wrapper function that will query the data in chunks using the variables defined above. This function will return the results as a single pandas DataFrame. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import asyncio\n",
+    "import itertools\n",
+    "import uuid\n",
+    "\n",
+    "# Limit the number of concurrent queries.\n",
+    "semaphore = asyncio.Semaphore(MAX_CONCURRENT_QUERIES)\n",
+    "\n",
+    "async def fetch_logs(query: str, start_time: datetime, end_time: datetime, query_id: str, correlation_request_id: str):\n",
+    "    async with semaphore:\n",
+    "        return await execute_query(query, start_time, end_time, query_id=query_id, correlation_request_id=correlation_request_id)\n",
+    "\n",
+    "\n",
+    "async def run():\n",
+    "    # Below, we combine the endpoints retrieved from both endpoint methods to ensure that the number of rows\n",
+    "    # and the size of the data returned are both within the limits.\n",
+    "    # Worst case performance is double the theoretical minimum number of queries necessary.\n",
+    "    row_batch_endpoints = await get_batch_endpoints_by_row_count(QUERY, END_TIME, days_back=DAYS_BACK, max_rows_per_query=MAX_ROWS_PER_QUERY)\n",
+    "    byte_batch_endpoints = await get_batch_endpoints_by_byte_size(QUERY, END_TIME, days_back=DAYS_BACK, max_bytes_per_query=MAX_BYTES_PER_QUERY)\n",
+    "    batch_endpoints = sorted(set(byte_batch_endpoints + row_batch_endpoints), reverse=True)\n",
+    "\n",
+    "    queries = []\n",
+    "    end_time = batch_endpoints[0]\n",
+    "    correlation_request_id = str(uuid.uuid4())\n",
+    "\n",
+    "    print(f\"Querying {len(batch_endpoints) - 1} time ranges, from {batch_endpoints[-1]} to {end_time}\")\n",
+    "    print(f\"Correlation request ID: {correlation_request_id}\")\n",
+    "    \n",
+    "    for i in range(1, len(batch_endpoints)):\n",
+    "        start_time = batch_endpoints[i]\n",
+    "        queries.append(fetch_logs(QUERY, start_time, end_time, query_id=str(i), correlation_request_id=correlation_request_id))\n",
+    "        end_time = start_time - timedelta(microseconds=1) # Subtract 1 microsecond to avoid overlap between queries.\n",
+    "\n",
+    "    responses = await asyncio.gather(*queries)\n",
+    "\n",
+    "    rows = itertools.chain.from_iterable([table.rows for table in responses if table])\n",
+    "    columns = responses[0].columns\n",
+    "    return pd.DataFrame(data=rows, columns=columns)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, go ahead and run the following cell to fetch the data. Note that this may take some time depending on the size of the data and the number of queries that need to be run."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = await run()\n",
+    "print(f\"Retrieved {len(df)} rows\")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After the data has been fetched, you can now use the `df` DataFrame for further analysis."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.head(30)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Optional: Export data to Azure Data Lake Storage (ADLS)\n",
+    "\n",
+    "If desired, the data queried from your Log Analytics workspace can be exported to an [Azure Data Lake Storage (ADLS)](https://learn.microsoft.com/azure/storage/blobs/data-lake-storage-introduction) account. This can be useful for storing the data for longer periods of time or for using it in other applications. To do this, the `azure-storage-file-datalake` Python package will be needed which uses the [ADLS Gen2 REST API](https://learn.microsoft.com/azure/storage/blobs/data-lake-storage-directory-file-acl-python) under the hood.\n",
+    "\n",
+    "### Setup\n",
+    "\n",
+    "First, ensure you have the required package installed:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "\n",
+    "!{sys.executable} -m pip install --upgrade azure-storage-file-datalake"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, input your storage [connection string](https://learn.microsoft.com/azure/storage/common/storage-account-keys-manage) below and instantiate the ADLS service client."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from azure.storage.filedatalake import DataLakeServiceClient\n",
+    "\n",
+    "AZURE_STORAGE_CONNECTION_STRING = \"<your connection string>\"\n",
+    "\n",
+    "try:\n",
+    "    adls_service_client = DataLakeServiceClient.from_connection_string(AZURE_STORAGE_CONNECTION_STRING)\n",
+    "except Exception as e:\n",
+    "    print(e)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "Next, define a helper function that can be used to interact with the ADLS storage account(s) to which the queried data will be exported to."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def upload_df_to_adls_path(\n",
+    "    df: pd.DataFrame, \n",
+    "    adls_dirname: str,\n",
+    "    adls_filename: str,\n",
+    "    container_name: str,\n",
+    "):\n",
+    "    \"\"\"\n",
+    "    Upload a pandas DataFrame to the specified ADLS path as a single JSON file.\n",
+    "    \"\"\"\n",
+    "    json_data = df.to_json(orient=\"records\", lines=True, date_format=\"iso\")\n",
+    "    file_system_client = adls_service_client.get_file_system_client(file_system=container_name)\n",
+    "\n",
+    "    try:\n",
+    "        file_system_client.create_directory(adls_dirname)\n",
+    "    except Exception as e:\n",
+    "        print(e)\n",
+    "\n",
+    "    try:\n",
+    "        directory_client = file_system_client.get_directory_client(adls_dirname)\n",
+    "        file_client = directory_client.get_file_client(adls_filename)\n",
+    "        file_client.upload_data(json_data, overwrite=True)\n",
+    "    except Exception as e:\n",
+    "        print(e)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Upload data\n",
+    "\n",
+    "Now, run the following cell to upload the data using the helper function defined above after configuring the variables below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# NAme of the storage container. This must already exist.\n",
+    "CONTAINER_NAME = \"<container_name>\"\n",
+    "\n",
+    "# Name of the directory to write to. This will be created if it does not exist.\n",
+    "DIRECTORY_NAME = \"monitor-log-dump\"\n",
+    "\n",
+    "# Name of the file to write to (include the .json extension).\n",
+    "FILENAME = \"monitor-log-dump.json\"\n",
+    "\n",
+    "upload_df_to_adls_path(df, DIRECTORY_NAME, FILENAME, CONTAINER_NAME)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "In this notebook, you learned how to query data from a Log Analytics workspace in chunks to avoid hitting the service limits. You also learned how to export the data to an Azure Data Lake Storage (ADLS) account."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  },
+  "orig_nbformat": 4,
+  "vscode": {
+   "interpreter": {
+    "hash": "665f5865bb085838e35a9597206be80722fad7fd0d11e0dfbe620869aad35c71"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}