Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be more permissive in what we consider a valid email address to be #2998

Closed
pdurbin opened this issue Mar 4, 2016 · 10 comments
Closed

Be more permissive in what we consider a valid email address to be #2998

pdurbin opened this issue Mar 4, 2016 · 10 comments

Comments

@pdurbin
Copy link
Member

pdurbin commented Mar 4, 2016

While working on #2512 and testing with random email addresses from the https://randomuser.me API, I observed that certain email addresses were treated as invalid. Examples include:

Is https://randomuser.me slipping us bad email addresses or is Dataverse being too strict about what is considered a valid email address?

The first three addresses above pass http://sphinx.mythic-beasts.com/~pdw/cgi-bin/emailvalidate but رونیکا.محمدخان@example.com does not. Locally, if I upgrade Validator – Commons Validator from 1.4.0 to 1.5.0, all the addresses above are considered valid. I'll push a test for this.

https://en.wikipedia.org/wiki/Email_address says "In addition to the above ASCII characters, international characters above U+007F, encoded as UTF-8, are permitted by RFC 6531, though mail systems may restrict which characters to use when assigning local parts."

Here's a screenshot of how a user can't be created because of an email address that doesn't pass validation:

screen shot 2016-03-03 at 12 10 50 pm

How strict do we want Dataverse to be with regard to email addresses?

@pdurbin
Copy link
Member Author

pdurbin commented Mar 4, 2016

In 8471c02 I added some tests to exercise this issue.

@djbrooke
Copy link
Contributor

djbrooke commented Apr 5, 2017

We should also allow emails ending in regional domains. See #3754 for more info.

@pdurbin pdurbin changed the title Strictness of email validation Be more permissive in what we consider a valid email address to be Jun 28, 2017
@pdurbin pdurbin added Type: Suggestion an idea User Role: Curator Curates and reviews datasets, manages permissions and removed Type: Bug a defect labels Jun 28, 2017
@oscardssmith oscardssmith self-assigned this Jul 13, 2017
@matthew-a-dunlap
Copy link
Contributor

Update of commons-validator fixes the issue. Comments were added to note the choice of 1.5 over 1.5.1 or 1.6 (errors were encountered with later versions and those versions did not provide any changes we needed).

@kcondon kcondon self-assigned this Jul 18, 2017
@kcondon
Copy link
Contributor

kcondon commented Jul 19, 2017

The signup filters now work but javax.mail throws errors when trying to send to those email addresses:

[2017-07-19T09:47:54.731-0400] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dataverse.MailServiceBean] [tid: _ThreadID=52 _ThreadName=jk
-connector(3)] [timeMillis: 1500472074731] [levelValue: 900] [[
Failed to send mail to michélle.pereboom2@mailinator.com]]

[2017-07-19T09:47:54.733-0400] [glassfish 4.1] [INFO] [] [] [tid: _ThreadID=52 _ThreadName=Thread-8] [timeMillis: 1500472074733] [levelV
alue: 800] [[
javax.mail.internet.AddressException: Local address contains control or whitespace in string ``michélle.pereboom2@mailinator.com''
at javax.mail.internet.InternetAddress.checkAddress(InternetAddress.java:1220)

Interestingly, the new domain names like .cologne seem to be OK though I could not verify end-to-end.

I have seen many posts online about this accents in email problem using Java and even set the jvm option: -Dmail.mime.allowutf8=true but that did not work.

There may be a solution or workaround but it needs investigation. I don't think we should allow this until this is fixed unless we don't care whether we actually send email to users.

@kcondon
Copy link
Contributor

kcondon commented Jul 19, 2017

Another potential issue to discuss is the current support for international email addresses and so the reliability of sending/receiving this mail. See:
https://en.wikipedia.org/wiki/International_email
The most significant aspect of this is the allowance of email addresses (also known as email identities) in most of the world's writing systems, at both interface and transport levels.
it is possible that the presence of UTF-8 characters in email headers would decrease the stability and reliability of transporting such email. This is becoming less and less the case as of 2014 and IDN (internationalized domain name) with the UTF-8 characters is taking over.
https://en.wikipedia.org/wiki/Email_address
Internationalization examples[edit]
The example addresses below would not be handled by RFC 5322 based servers, but are permitted by RFC 6530. Servers compliant with this will be able to handle these:
Latin alphabet (with diacritics): Pelé@example.com
Greek alphabet: δοκιμή@παράδειγμα.δοκιμή
Traditional Chinese characters: 我買@屋企.香港
Japanese characters: 甲斐@黒川.日本
Cyrillic characters: чебурашка@ящик-с-апельсинами.рф
Hindi email address: संपर्क@डाटामेल.भारत
Internationalization support[edit]
Postfix mailer supports internationalized mail since 2015-02-08 with a stable release 3.0.0.[28]
Google has support for sending emails to and from internationalized domains, but does not allow the registration of non-ASCII email addresses.[29]
Microsoft added similar functionality in Outlook 2016[30]

@djbrooke
Copy link
Contributor

Thanks for the discussion about this post-standup. The benefits of allowing these additional email addresses is a big plus. I understand there is some risk of undelivered mail, but we'll see how widespread it is before investigating adding a whitelist/blacklist or some other solution.

@landreev
Copy link
Contributor

Sorry, I misspoke during the meeting earlier - there are still dots in these "new" email addresses; we are talking about a few new top-level domains added recently. Such as ".cologne". The complete address still looks like somebody@someaddress.cologne .
But my main point still stands: since we are only validating the top-level domains, and not the whole address - that means we are already allowing bad, undeliverable addresses. Anybody can already enter foobar@madeupdomain.edu; we'll let it slide, because .edu is a legit top-level domain. So, we are already allowing undeliverable addresses. Meaning we are not really introducing anything principally new by recognizing some new top-level domains, or non-latin characters. We are still going to try and check that they are not entering totally made up junk; but it's still the responsibility of the user to supply a working address. Unless/until we start validating them right away.

@kcondon
Copy link
Contributor

kcondon commented Jul 19, 2017

Except that someone who enters a junk address knows that and does not expect email, whereas someone entering a valid international address would expect to receive mail and when they don't will open a support ticket. Also, you literally could not create an account with international characters before this change was made. Just saying.

@landreev
Copy link
Contributor

True. A better parallel is not somebody knowingly entering a bad address; but a user misspelling/making a typo in their address. Like if I enter loenid@hdmc.havrard.edu - I may be expecting to receive mail, but it's not going to work.
But, this is precisely why we have that "verify your email address" feature - to allow them to make sure it is working, right?

@djbrooke djbrooke assigned sekmiller and unassigned oscardssmith Jul 19, 2017
@sekmiller sekmiller removed their assignment Jul 20, 2017
@djbrooke djbrooke added this to the 4.8 - Large Data Upload Integration milestone Jul 20, 2017
@kcondon kcondon self-assigned this Jul 20, 2017
@kcondon kcondon closed this as completed Jul 20, 2017
@pdurbin
Copy link
Member Author

pdurbin commented Jul 27, 2017

@landreev yes, that's why we have the "verify your email address" feature. If the link we try to send them never reaches them they will hopefully realize they've supplied the wrong email address. More importantly, we don't want bad actors to be able to sign up with president@whitehouse.gov (some email address they don't actually control) and have the Dataverse installation spam the poor president. Right now there are still no consequences if you don't verify your email address, but in the future we'd like to make it so that if you don't verify your email address we don't email you (or the per person whose email you signed up with!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants