Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: cache IANA TLDs for faster lookups #390

Merged

Conversation

salty-horse
Copy link
Contributor

Follow-up to the discussion in #362.

I wasn't sure what was meant by using dataclass, as this isn't a data-first class. I used a regular class, instead.

One thing that's obviously missing is tests. I'm not familiar with pytest, and don't know how to re-run the existing domain tests after running the new "load" function.

Here are some basic timing results:

$ python -m timeit -s 'import validators' 'validators.domain("example.com", consider_tld=True)'
5000 loops, best of 5: 63.5 usec per loop
$ python -m timeit -s 'import validators; validators.load_iana_tlds_to_memory()' 'validators.domain("example.com", consider_tld=True)'
50000 loops, best of 5: 4.63 usec per loop

$ python -m timeit -s 'import validators' 'validators.domain("example.blabla", consider_tld=True)'
1000 loops, best of 5: 200 usec per loop
$ python -m timeit -s 'import validators; validators.load_iana_tlds_to_memory()' 'validators.domain("example.blabla", consider_tld=True)'
20000 loops, best of 5: 16.5 usec per loop

@yozachar
Copy link
Collaborator

yozachar commented Jul 17, 2024

Hi, validators already uses an environment variable to decide whether or not to throw errors, so I thought (again that) we could do something like this instead:

class _TLDList:
    """Cache IANA TLDs."""
    cache = set[str]()


def _load_tld_to_memory(tld_file_path: Path):
    """Load IANA TLDs to memory."""
    if not _TLDList.cache:
        with tld_file_path.open() as tld_f:
            _ = next(tld_f)  # ignore the first line
            _TLDList.cache = set(line.strip() for line in tld_f)
    return _TLDList.cache


def _iana_tld():
    """Provide IANA TLDs."""
    # # source: https://data.iana.org/TLD/tlds-alpha-by-domain.txt
    tld_file_path = Path(__file__).parent.joinpath("_tld.txt")

    if environ.get("LOAD_TLD_TO_MEMORY", "False") == "True":
        return _load_tld_to_memory(tld_file_path)

    with tld_file_path.open() as tld_f:
        _ = next(tld_f)  # ignore the first line
        for line in tld_f:
            yield line.strip()

Environment LOAD_TLD_TO_MEMORY is the pivot here.

@salty-horse
Copy link
Contributor Author

Some questions:

  1. Where should the env var be documented? In the API reference for the domain validator, or in "Install and Use"?
  2. The names of both RAISE_VALIDATION_ERROR and LOAD_TLD_TO_MEMORY are awfully vague, and a bad practice since they are used in a global setting with a lot of other unrelated environment variables. Shouldn't they include a prefix such as PYTHON_VALIDATORS_? How do you expect them to be used in a way that makes it obvious what they affect? Setting os.environ['LOAD_TLD_TO_MEMORY'] = 'True' early in the project that uses it?
  3. Is the code above how you want to structure it? With private global functions instead of methods inside the private class. I think it's cleaner to open the file in one place, but I can just copy your code if that's what you want.
  4. Also about your code, I noticed that the lowest supported Python version is 3.8, which doesn't support type hinting like set[str].
  5. Still unclear how to test this with pytest by reusing the existing tests.

@yozachar
Copy link
Collaborator

  1. I was planning to put it on the Install and Use page, but I think, another page can be added say Environment Variables. It is probably better to update the documentation in a followup PR.

  2. ... are awfully vague, and a bad practice since they are used in a global setting with a lot of other unrelated environment variables.

    Very true! You can for now set the environment variable as PYVLD_LOAD_TLD_TO_MEMORY. (Is PYVLD_ a good prefix? Short ones are preferable) Again, another PR can bring more consistency.

    How do you expect them to be used in a way that makes it obvious what they affect? Setting os.environ['LOAD_TLD_TO_MEMORY'] = 'True' early in the project that uses it?

    As the documentation shows, it must be set via the shell, before running your program.

  3. You have the creative freedom, @classmethods are totally fine. You no longer need to expose any functions to load TLDs.

  4. Ah, good catch! My bad.

  5. Currently, all the tests pass 'cause TLDs aren't being considered while testing. I'll take care of writing tests for the other environment variable too.

@salty-horse
Copy link
Contributor Author

BTW, I wonder if it's worth it to hardcode 'com', 'org', 'net' outside of the file, to catch the most common TLD's quickly.

@salty-horse
Copy link
Contributor Author

Pushed a commit with changes, including my hard-coded list to cover the common TLDs before trying the file.

For the documentation, I think it should be covered/linked the domain, hostname, and URL pages, regardless of any other place you think is important. Someone who's just reading about consider_tlds needs to understand what it's doing with/without the env file and decide whether they need it.

Copy link
Collaborator

@yozachar yozachar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel, free to counter suggest/correct/modify/improve.

src/validators/domain.py Outdated Show resolved Hide resolved
src/validators/domain.py Outdated Show resolved Hide resolved
src/validators/domain.py Outdated Show resolved Hide resolved
src/validators/domain.py Outdated Show resolved Hide resolved
src/validators/domain.py Show resolved Hide resolved
src/validators/domain.py Outdated Show resolved Hide resolved
@salty-horse
Copy link
Contributor Author

Should I squash all the commits, or do you want to merge it yourself?

@yozachar
Copy link
Collaborator

I'll squash and merge.

@yozachar yozachar changed the title feat: add load_iana_tlds_to_memory for faster lookups with consider_tld=True feat: cache IANA TLDs for faster lookups Jul 19, 2024
@yozachar yozachar merged commit 2dc2a9e into python-validators:master Jul 19, 2024
17 checks passed
@yozachar
Copy link
Collaborator

Thanks for the PR!

@salty-horse
Copy link
Contributor Author

Thank you for the help and accepting the feature!

@davidt99 davidt99 mentioned this pull request Sep 2, 2024
@yozachar yozachar added the enhancement Issue/PR: A new feature label Sep 13, 2024
@yozachar yozachar self-assigned this Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR: A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants