Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved crawler support for dynamically loaded pages #2710

Merged

Conversation

danielbichuetti
Copy link
Contributor

@danielbichuetti danielbichuetti commented Jun 22, 2022

Related Issue(s): #2658 #2679

Proposed changes:

  • To avoid errors when objects are removed from DOM after Crawler has captured the links, now we have error treatment
  • It has been included in constructor and class methods one optional argument: loading_wait_time. If this argument is not provided Crawler won't wait to begin scraping, and if any element is removed from DOM the error treatment will be triggered, the error will be logged and Crawler will continue. If provided (in seconds), Crawler will wait x seconds after each page load before it begins scraping. After a long investigation the preferred method was timer.sleep despite Selenium having an implicit wait, this choice was made due to instabilities on some browser and driver versions which turned it useless.
  • Test has been added for dynamic DOM scenario
  • Some methods that Selenium will be deprecating have been updated to new ones

Pre-flight checklist

@CLAassistant
Copy link

CLAassistant commented Jun 22, 2022

CLA assistant check
All committers have signed the CLA.

@ZanSara ZanSara self-requested a review June 23, 2022 07:47
@ZanSara ZanSara added type:bug Something isn't working type:feature New feature or request topic:crawler labels Jun 23, 2022
Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this contribution! 😊 This PR looks very good already. I added a few comments, but they're all marginal improvements, as I detected no issues with your code.

Let me know if you have the time to address these points, otherwise I'll help you with them so that we can merge this soon.

haystack/nodes/connector/crawler.py Outdated Show resolved Hide resolved
test/nodes/test_connector.py Outdated Show resolved Hide resolved
test/nodes/test_connector.py Outdated Show resolved Hide resolved
test/nodes/test_connector.py Show resolved Hide resolved
@danielbichuetti
Copy link
Contributor Author

Hi @ZanSara

Thanks for the kind answer! I have committed the changes. I decided to move the result to a text file since there is no reason to crawl again, as we are setting up an artificial result for the dynamic DOM.

Best regards!

@ZanSara
Copy link
Contributor

ZanSara commented Jun 24, 2022

Hi @danielbichuetti, great, this looks really good! I'll wait for the CI to complete and if there's no issue with the tests I'll approve and merge 👍 Thank you for the contribution! 😊

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you! I left two comments, about really small details. If you don't have time to address them no problem, I'll take care of them and merge afterwards 👍

test/nodes/test_connector.py Outdated Show resolved Hide resolved
test/nodes/test_connector.py Outdated Show resolved Hide resolved
@danielbichuetti
Copy link
Contributor Author

Any more changes required before merging this PR ?

Best regards

@ZanSara ZanSara merged commit e3b2ee9 into deepset-ai:master Jul 1, 2022
@ZanSara
Copy link
Contributor

ZanSara commented Jul 1, 2022

No, that's good! Thank you for the contribution! 🙂

@danielbichuetti danielbichuetti deleted the add_crawler_loading_wait_time branch July 4, 2022 14:32
Krak91 pushed a commit to Krak91/haystack that referenced this pull request Jul 26, 2022
* Improved crawler support for dynamically loaded pages

* Reduced scope of StaleElementReferenceException and removed deprecated code from WebDriver initialization

* Improvements on crawler testing code

* Code format and style applied on f028331

* Update Documentation & Code Style

* Remove unused imports/parameters

Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:crawler type:bug Something isn't working type:feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants