Skip to content

Commit

Permalink
feat: Improve UX of choosing new workflow crawl type (#2067)
Browse files Browse the repository at this point in the history
Resolves #2066

### Changes
- Allows directly choosing new "Page List" or "Site Crawl from
workflow list
- Reverts terminology introduced in
#2032
  • Loading branch information
SuaYoo committed Sep 9, 2024
1 parent b4e34d1 commit c01e3dd
Show file tree
Hide file tree
Showing 9 changed files with 111 additions and 73 deletions.
10 changes: 5 additions & 5 deletions docs/user-guide/crawl-workflows.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,19 @@ Create new crawl workflows from the **Crawling** page, or the _Create New ..._

### Choose what to crawl

The first step in creating a new crawl workflow is to choose what you'd like to crawl. This determines whether the crawl type will be **URL List** or **Seeded Crawl**. Crawl types can't be changed after the workflow is created—you'll need to create a new crawl workflow.
The first step in creating a new crawl workflow is to choose what you'd like to crawl. This determines whether the crawl type will be **Page List** or **Site Crawl**. Crawl types can't be changed after the workflow is created—you'll need to create a new crawl workflow.

#### Known URLs `URL List`{ .badge-blue }
#### Page List

Choose this option if you already know the URL of every page you'd like to crawl. The crawler will visit every URL specified in a list, and optionally every URL linked on those pages.

A URL list is simpler to configure, since you don't need to worry about configuring the workflow to exclude parts of the website that you may not want to archive.
A Page List workflow is simpler to configure, since you don't need to worry about configuring the workflow to exclude parts of the website that you may not want to archive.

#### Automated Discovery `Seeded Crawl`{ .badge-orange }
#### Site Crawl

Let the crawler automatically discover pages based on a domain or start page that you specify.

Seeded crawls are great for advanced use cases where you don't need (or want) to know every single URL of the website that you're archiving.
Site Crawl workflows are great for advanced use cases where you don't need (or want) to know every single URL of the website that you're archiving.

After deciding what type of crawl you'd like to run, you can begin to set up your workflow. A detailed breakdown of available settings can be found in the [workflow settings guide](workflow-setup.md).

Expand Down
4 changes: 2 additions & 2 deletions docs/user-guide/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ To start crawling with hosted Browsertrix, you'll need a Browsertrix account. [S
Once you've logged in you should see your org [overview](overview.md). If you land somewhere else, navigate to **Overview**.

1. Tap the _Create New..._ shortcut and select **Crawl Workflow**.
2. Choose **Known URLs**. We'll get into the details of the options [later](./crawl-workflows.md), but this is a good starting point for a simple crawl.
3. Enter the URL of the webpage that you noted earlier in **Crawl URL(s)**.
2. Choose **Page List**. We'll get into the details of the options [later](./crawl-workflows.md), but this is a good starting point for a simple crawl.
3. Enter the URL of the webpage that you noted earlier in **Page URL(s)**.
4. Tap _Review & Save_.
5. Tap _Save Workflow_.
6. You should now see your new crawl workflow. Give the crawler a few moments to warm up, and then watch as it archives the webpage!
Expand Down
16 changes: 8 additions & 8 deletions docs/user-guide/workflow-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,32 +8,32 @@ Crawl settings are shown in the crawl workflow detail **Settings** tab and in th

## Crawl Scope

Specify the range and depth of your crawl. Different settings will be shown depending on whether you chose _Known URLs_ (crawl type of **URL List**) or _Automated Discovery_ (crawl type of **Seeded Crawl**) when creating a new workflow.
Specify the range and depth of your crawl. Different settings will be shown depending on whether you chose _URL List_ or _Site Crawl_ when creating a new workflow.

??? example "Crawling with HTTP basic auth"

Both URL List and Seeded crawls support [HTTP Basic Auth](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication) which can be provided as part of the URL, for example: `https://username:password@example.com`.
Both Page List and Site Crawls support [HTTP Basic Auth](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication) which can be provided as part of the URL, for example: `https://username:password@example.com`.

**These credentials WILL BE WRITTEN into the archive.** We recommend exercising caution and only archiving with dedicated archival accounts, changing your password or deleting the account when finished.

### Crawl Type: URL List
### Crawl Type: Page List

#### Crawl URL(s)
#### Page URL(s)

A list of one or more URLs that the crawler should visit and capture.

#### Include Any Linked Page

When enabled, the crawler will visit all the links it finds within each page defined in the _Crawl URL(s)_ field.

??? example "Crawling tags & search queries with URL List crawls"
??? example "Crawling tags & search queries with Page List crawls"
This setting can be useful for crawling the content of specific tags or search queries. Specify the tag or search query URL(s) in the _Crawl URL(s)_ field, e.g: `https://example.com/search?q=tag`, and enable _Include Any Linked Page_ to crawl all the content present on that search query page.

#### Fail Crawl on Failed URL

When enabled, the crawler will fail the entire crawl if any of the provided URLs are invalid or unsuccessfully crawled. The resulting archived item will have a status of "Failed".

### Crawl Type: Seeded Crawl
### Crawl Type: Site Crawl

#### Crawl Start URL

Expand Down Expand Up @@ -84,7 +84,7 @@ This can be useful for discovering and capturing pages on a website that aren't

### Exclusions

The exclusions table will instruct the crawler to ignore links it finds on pages where all or part of the link matches an exclusion found in the table. The table is only available in URL List crawls when _Include Any Linked Page_ is enabled.
The exclusions table will instruct the crawler to ignore links it finds on pages where all or part of the link matches an exclusion found in the table. The table is only available in Page List crawls when _Include Any Linked Page_ is enabled.

This can be useful for avoiding crawler traps — sites that may automatically generate pages such as calendars or filter options — or other pages that should not be crawled according to their URL.

Expand Down Expand Up @@ -228,7 +228,7 @@ Describe and organize your crawl workflow and the resulting archived items.

### Name

Allows a custom name to be set for the workflow. If no name is set, the workflow's name will be set to the _Crawl Start URL_. For URL List crawls, the workflow's name will be set to the first URL present in the _Crawl URL(s)_ field, with an added `(+x)` where `x` represents the total number of URLs in the list.
Allows a custom name to be set for the workflow. If no name is set, the workflow's name will be set to the _Crawl Start URL_. For Page List crawls, the workflow's name will be set to the first URL present in the _Crawl URL(s)_ field, with an added `(+x)` where `x` represents the total number of URLs in the list.

### Description

Expand Down
20 changes: 11 additions & 9 deletions frontend/src/components/ui/config-details.ts
Original file line number Diff line number Diff line change
Expand Up @@ -333,7 +333,7 @@ export class ConfigDetails extends LiteElement {

return html`
${this.renderSetting(
msg("Crawl URL(s)"),
msg("Page URL(s)"),
html`
<ul>
${this.seeds?.map(
Expand Down Expand Up @@ -375,14 +375,16 @@ export class ConfigDetails extends LiteElement {
primarySeedConfig?.include || seedsConfig.include || [];
return html`
${this.renderSetting(
msg("Primary Seed URL"),
html`<a
class="text-blue-600 hover:text-blue-500 hover:underline"
href="${primarySeedUrl!}"
target="_blank"
rel="noreferrer"
>${primarySeedUrl}</a
>`,
msg("Crawl Start URL"),
primarySeedUrl
? html`<a
class="text-blue-600 hover:text-blue-500 hover:underline"
href="${primarySeedUrl}"
target="_blank"
rel="noreferrer"
>${primarySeedUrl}</a
>`
: undefined,
true,
)}
${this.renderSetting(
Expand Down
56 changes: 27 additions & 29 deletions frontend/src/features/crawl-workflows/new-workflow-dialog.ts
Original file line number Diff line number Diff line change
Expand Up @@ -43,14 +43,15 @@ export class NewWorkflowDialog extends TailwindElement {
src=${urlListSvg}
/>
<figcaption class="p-1">
<div
class="my-2 text-lg font-semibold leading-none transition-colors group-hover:text-primary-700"
>
${msg("Known URLs")}
<div class="leading none my-2 font-semibold">
<div class="transition-colors group-hover:text-primary-700">
${msg("Page List")}:
</div>
<div class="text-lg">${msg("One or more URLs")}</div>
</div>
<p class="text-balance leading-normal text-neutral-700">
<p class="leading-normal text-neutral-700">
${msg(
"Choose this option to crawl a single page, or if you already know the URL of every page you'd like to crawl.",
"Choose this option if you know the URL of every page you'd like to crawl and don't need to include any additional pages beyond one hop out.",
)}
</p>
</figcaption>
Expand All @@ -72,14 +73,15 @@ export class NewWorkflowDialog extends TailwindElement {
src=${seededCrawlSvg}
/>
<figcaption class="p-1">
<div
class="my-2 text-lg font-semibold leading-none transition-colors group-hover:text-primary-700"
>
${msg("Automated Discovery")}
<div class="leading none my-2 font-semibold">
<div class="transition-colors group-hover:text-primary-700">
${msg("Site Crawl")}:
</div>
<div class="text-lg">${msg("Website or directory")}</div>
</div>
<p class="text-balance leading-normal text-neutral-700">
<p class="leading-normal text-neutral-700">
${msg(
"Let the crawler automatically discover pages based on a domain or start page that you specify.",
"Specify a domain name, start page URL, or path on a website and let the crawler automatically find pages within that scope.",
)}
</p>
</figcaption>
Expand All @@ -92,32 +94,28 @@ export class NewWorkflowDialog extends TailwindElement {
@sl-after-hide=${this.stopProp}
>
<p class="mb-3">
${msg(
html`Choose <strong>Known URLs</strong> (aka a "URL List" crawl
type) if:`,
)}
${msg(html`Choose <strong>Page List</strong> if:`)}
</p>
<ul class="mb-3 list-disc pl-5">
<li>${msg("You want to archive a single page on a website")}</li>
<li>
${msg("You're archiving just a few specific pages on a website")}
${msg("You have a list of URLs that you can copy-and-paste")}
</li>
<li>
${msg("You have a list of URLs that you can copy-and-paste")}
${msg(
"You want to include URLs with different domain names in the same crawl",
)}
</li>
</ul>
<p class="mb-3">
${msg(
html`A URL list is simpler to configure, since you don't need to
worry about configuring the workflow to exclude parts of the
website that you may not want to archive.`,
html`A Page List workflow is simpler to configure, since you don't
need to worry about configuring the workflow to exclude parts of
the website that you may not want to archive.`,
)}
</p>
<p class="mb-3">
${msg(
html`Choose <strong>Automated Discovery</strong> (aka a "Seeded
Crawl" crawl type) if:`,
)}
${msg(html`Choose <strong>Site Crawl</strong> if:`)}
</p>
<ul class="mb-3 list-disc pl-5">
<li>${msg("You want to archive an entire website")}</li>
Expand All @@ -136,10 +134,10 @@ export class NewWorkflowDialog extends TailwindElement {
</ul>
<p class="mb-3">
${msg(
html`Seeded crawls are great for advanced use cases where you
don't need to know every single URL that you want to archive. You
can configure reasonable crawl limits and page limits so that you
don't crawl more than you need to.`,
html`Site Crawl workflows are great for advanced use cases where
you don't need to know every single URL that you want to archive.
You can configure reasonable crawl limits and page limits so that
you don't crawl more than you need to.`,
)}
</p>
<p>
Expand Down
4 changes: 2 additions & 2 deletions frontend/src/features/crawl-workflows/workflow-editor.ts
Original file line number Diff line number Diff line change
Expand Up @@ -713,7 +713,7 @@ export class WorkflowEditor extends BtrixElement {
<sl-textarea
name="urlList"
class="textarea-wrap"
label=${msg("Crawl URL(s)")}
label=${msg("Page URL(s)")}
rows="10"
autocomplete="off"
inputmode="url"
Expand Down Expand Up @@ -1105,7 +1105,7 @@ https://example.net`}
${inputCol(html`
<sl-textarea
name="urlList"
label=${msg("Crawl URL(s)")}
label=${msg("Page URL(s)")}
rows="3"
autocomplete="off"
inputmode="url"
Expand Down
4 changes: 4 additions & 0 deletions frontend/src/pages/org/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -536,6 +536,10 @@ export class Org extends LiteElement {

return html`<btrix-workflows-list
@select-new-dialog=${this.onSelectNewDialog}
@select-job-type=${(e: SelectJobTypeEvent) => {
this.openDialogName = undefined;
this.navTo(`${this.orgBasePath}/workflows?new&jobType=${e.detail}`);
}}
></btrix-workflows-list>`;
};

Expand Down
62 changes: 48 additions & 14 deletions frontend/src/pages/org/workflows-list.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import { localized, msg, str } from "@lit/localize";
import type { SlCheckbox } from "@shoelace-style/shoelace";
import type { SlCheckbox, SlSelectEvent } from "@shoelace-style/shoelace";
import { type PropertyValues } from "lit";
import { customElement, state } from "lit/decorators.js";
import { ifDefined } from "lit/directives/if-defined.js";
Expand All @@ -13,6 +13,7 @@ import type { SelectNewDialogEvent } from ".";
import { CopyButton } from "@/components/ui/copy-button";
import type { PageChangeEvent } from "@/components/ui/pagination";
import { type SelectEvent } from "@/components/ui/search-combobox";
import type { SelectJobTypeEvent } from "@/features/crawl-workflows/new-workflow-dialog";
import { pageHeader } from "@/layouts/pageHeader";
import type { APIPaginatedList, APIPaginationQuery } from "@/types/api";
import { isApiError } from "@/utils/api";
Expand Down Expand Up @@ -208,21 +209,54 @@ export class WorkflowsList extends LiteElement {
${when(
this.appState.isCrawler,
() => html`
<sl-button
variant="primary"
size="small"
?disabled=${this.org?.readOnly}
@click=${() => {
this.dispatchEvent(
new CustomEvent("select-new-dialog", {
detail: "workflow",
}) as SelectNewDialogEvent,
);
<sl-dropdown
distance="4"
placement="bottom-end"
@sl-select=${(e: SlSelectEvent) => {
const { value } = e.detail.item;
if (value) {
this.dispatchEvent(
new CustomEvent<SelectJobTypeEvent["detail"]>(
"select-job-type",
{
detail: value as SelectJobTypeEvent["detail"],
},
),
);
} else {
this.dispatchEvent(
new CustomEvent("select-new-dialog", {
detail: "workflow",
}) as SelectNewDialogEvent,
);
}
}}
>
<sl-icon slot="prefix" name="plus-lg"></sl-icon>
${msg("New Workflow")}
</sl-button>
<sl-button
slot="trigger"
size="small"
variant="primary"
caret
?disabled=${this.org?.readOnly}
>
<sl-icon slot="prefix" name="plus-lg"></sl-icon>
${msg("New Workflow...")}
</sl-button>
<sl-menu>
<sl-menu-item value="url-list">
${msg("Page List")}
</sl-menu-item>
<sl-menu-item value="seed-crawl">
${msg("Site Crawl")}
</sl-menu-item>
<sl-divider> </sl-divider>
<sl-menu-item>
<sl-icon slot="prefix" name="question-circle"></sl-icon>
${msg("Help me decide")}
</sl-menu-item>
</sl-menu>
</sl-dropdown>
`,
)}
`,
Expand Down
8 changes: 4 additions & 4 deletions frontend/src/pages/org/workflows-new.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { localized, msg } from "@lit/localize";
import { localized, msg, str } from "@lit/localize";
import { mergeDeep } from "immutable";
import type { LitElement } from "lit";
import { customElement, property } from "lit/decorators.js";
Expand Down Expand Up @@ -59,8 +59,8 @@ export class WorkflowsNew extends LiteElement {
initialWorkflow?: WorkflowParams;

private readonly jobTypeLabels: Record<JobType, string> = {
"url-list": msg("URL List"),
"seed-crawl": msg("Seeded Crawl"),
"url-list": msg("Page List"),
"seed-crawl": msg("Site Crawl"),
custom: msg("Custom"),
};

Expand Down Expand Up @@ -98,7 +98,7 @@ export class WorkflowsNew extends LiteElement {
return html`
<div class="mb-5">${this.renderBreadcrumbs()}</div>
<h2 class="mb-6 text-xl font-semibold">
${msg("New")} ${this.jobTypeLabels[jobType]}
${msg(str`New ${this.jobTypeLabels[jobType]} Workflow`)}
</h2>
${when(this.org, (org) => {
const initialWorkflow = mergeDeep(
Expand Down

0 comments on commit c01e3dd

Please sign in to comment.