-
Notifications
You must be signed in to change notification settings - Fork 14.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sitemap specify default filter url (Fixes: CVE-2023-46229) #11925
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
@@ -20,8 +21,43 @@ def _batch_block(iterable: Iterable, size: int) -> Generator[List[dict], None, N | |||
yield item | |||
|
|||
|
|||
def _extract_domain_and_scheme(url: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider renaming to the order in which the components appear in the string
def _extract_domain_and_scheme(url: str) -> str: | |
def _extract_scheme_and_domain(url: str) -> str: |
self.filter_urls: Optional[List[str]] = [ | ||
_extract_domain_and_scheme(web_path) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filter URLs are treated as a regex, so extracting http://example.com
from the web path will inappropriately match both http://examplexcom.org
and http://example.com.attacker.com/
A better approach would be to parse each target URL and check it against the scheme and domain, then match it against the regex with proper escaping.
self.filter_urls = filter_urls | ||
# Define a list of URL patterns (interpreted as regular expressions) that | ||
# will be allowed to be loaded. | ||
# restrict_to_same_domain takes precedence over filter_urls when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention the precedence in the docstring?
if self.allow_url_patterns and not any( | ||
re.match(regexp_pattern, loc_text) | ||
for regexp_pattern in self.allow_url_patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we care about performance at all, but this will recompile all the regexes for every visited URL which is a bit of a heavyweight operation. The regexes don't change so we could compile them once and then use the compiled forms each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not important in this case, can always optimize later if needed
Co-authored-by: Predrag Gruevski <2348618+obi1kenobi@users.noreply.github.com>
Fixes: CVE-2023-46229 |
Specify default filter URL in sitemap loader and add a security note --------- Co-authored-by: Predrag Gruevski <2348618+obi1kenobi@users.noreply.github.com>
Specify default filter URL in sitemap loader and add a security note
Fixes: CVE-2023-46229