Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL not detected correctly, when it contains a '#' #556

Closed
klartext opened this issue Jan 22, 2021 · 10 comments
Closed

URL not detected correctly, when it contains a '#' #556

klartext opened this issue Jan 22, 2021 · 10 comments
Labels

Comments

@klartext
Copy link

Describe the bug
URL not detected correctly, when it contains a '#'
http://www.mywebsite.org/#/foobar
Url-detection stops at the '#'

To Reproduce

Use an URL with a hashtag, like the one mentioned above, and insert it in edit mode
and view the result in the preview mode.

Expected behavior
The URL is detected correctly completely and detection does not stop at the '#' symbol.

Versions:
Rednotebook 2.20, Arch-Linux, package 2.20-2 (2020-11-12).

@laraconda
Copy link
Contributor

I believe this could be fixed by changing HASHTAG_PATTERN from rednotebook/data.py.

@jendrikseipp
Copy link
Owner

Probably, yes. Feel free to raise a PR :)

@laraconda
Copy link
Contributor

Alright. I discovered that the patterns in data.py match hashtags and those are used for the Tags section. The matching done for the coloring of the text, recognition of urls and such is done in rednotebook/files/t2t.lang.

A single source of regexes is necessary to avoid confusion like this one. The second file uses xml format to define how the matches are treated. Is the regex format in this file compliant with the python format? If so, regexes can be defined in a single place and imported in both data.py and in t2t.lang.

@laraconda
Copy link
Contributor

laraconda commented May 31, 2023

Another thing that I noticed was that data.py hashtags are excluded from being hex colors (as in #AAA000). Also are excluded from being what I presume are cpp directives such as "#include". I have to observations:

  1. Given that the pattern HASHTAG is compiled with the flag re.IGNORECASE, tags like #face10, #facade and such will be treated as hexcolors (which they are it seems). I don't know if this is the intended effect. Given how the original regex for hex numbers is written (r"[0-9A-F]{6}"), it feels like the original intent was to only match hex colors written in uppercase.
  2. If I'm correct that "#include" is intended to exclude cpp directives, then it is necessary to add more like #define, #endif and such. (I already coded this so I hope I'm right).

@klartext
Copy link
Author

klartext commented Jun 1, 2023

I wonder if not using a library for the URL-check would make sense.
I looked for URL-parsing and url-validation.

For parsing URLs, urllib (Python Standard-Lib) can be used. But it does not check url-validity.

I then looked for URL-validation and found the validators lib.
I have not used it so far, but it was recommended in some aticles I found, and the lib seems to be used a lot and was updated last week. Some Issues there, but nothing that looks bad.
This one: https://github.com/python-validators/validators

Maybe that lib might be considered here.

@laraconda
Copy link
Contributor

I think fixing the patterns in t2t.lang would be better. The code is already coupled with it, using a new library would be a lot more work. Plus, it seems like GTK Python uses an xml file to identify patterns and then do something with them like displaying them in bold, underline, coloring urls, etc. So i don't think evaluating each piece of text with a python function would work here.
Thanks for the suggestion anyways.

@laraconda
Copy link
Contributor

laraconda commented Jun 1, 2023

In other topics: Turns out the use of '#' alone doesn't cause problems in url recognition, the use of '/' after '#' does:

These examples are correctly identified as links:
http://blog.example.net/post123#comments
http://www.example.com/page.html#section1
https://www.shop.com/product#reviews
http://www.example.org/#contact

These are not:
http://www.mywebsite.org/#/foobar
http://www.mywebsite.org/#foo/bar

@jendrikseipp
Copy link
Owner

I agree. Adding an external library is always a big pain.

I don't see an easy way of avoiding duplicating the hashtag regex in data.py and t2t.lang, since we'd have to parse the XML with Python, which seems excessive. It's probably best to just add a comment in both places that changing one line implies changing the other.

Regarding hex values and C++ preprocessor directives, I agree with you, good catch!

laraconda added a commit to laraconda/rednotebook that referenced this issue Jun 1, 2023
@laraconda
Copy link
Contributor

I submitted the PR: #703

jendrikseipp pushed a commit that referenced this issue Jun 10, 2023
* Fixing url not recognized when hashtag symbol is followed by slash. Issue #556
* Adding more cpp directives to hashtag pattern in t2t. Adding comment regarding what each hashtag regex is used for in both files.
---------
Co-authored-by: Jendrik Seipp <jendrikseipp@gmail.com>
@jendrikseipp
Copy link
Owner

Fixed in #703.

jendrikseipp pushed a commit to laraconda/rednotebook that referenced this issue May 5, 2024
…#703)

* Fixing url not recognized when hashtag symbol is followed by slash. Issue jendrikseipp#556
* Adding more cpp directives to hashtag pattern in t2t. Adding comment regarding what each hashtag regex is used for in both files.
---------
Co-authored-by: Jendrik Seipp <jendrikseipp@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants