Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names #2202

Closed
mheppler opened this issue May 26, 2015 · 24 comments
Closed

Comments

@mheppler
Copy link
Contributor

mheppler commented May 26, 2015

As referenced in #2192, there are files in production that need friendly MIME Type labels.

From @landreev

The file in question is ./src/main/java/MimeTypeDisplay.properties

We should identify as many of these as possible, and give them friendlier display names that the one that @pdurbin found.

documentation_and_metadata_-training_materials_dataverse-_2015-05-24_10 00 24

@mheppler mheppler added UX & UI: Design This issue needs input on the design of the UI and from the product owner Feature: File Upload & Handling labels May 26, 2015
@landreev
Copy link
Contributor

landreev commented Jun 1, 2015

Also, I believe we should extend this "friendly name" functionality, to support wild cards.
As in:

image/jpeg=JPEG Image
image/gif=GIF Image
image/bmp=Windows Bitmap Image
image/*=Graphic Image

i.e., we provide friendly names for the types we know about; and a generic name for an image of type image/blah-blah that's not specifically listed.
We can do the same with MS documents and other types of files. Because we'll always be encountering file types we don't know about.

@scolapasta scolapasta added this to the Candidates for 4.0.3 milestone Jun 1, 2015
@scolapasta scolapasta modified the milestones: 4.2, Candidates for 4.2 Jul 15, 2015
@mheppler
Copy link
Contributor Author

Currently, the File Type values that are delivered from MimeTypeFacets.properties are lower case (see attached). I suggest that we capitalize them.

screen shot 2015-08-18 at 12 01 31 pm

@pdurbin
Copy link
Member

pdurbin commented Sep 11, 2015

@scolapasta I'm passing this to you for a decision of what to do for 4.2.

@scolapasta scolapasta modified the milestones: Candidates for 4.3, 4.2 Sep 17, 2015
@mercecrosas mercecrosas modified the milestones: Candidates for 4.3, In Review Nov 30, 2015
@scolapasta scolapasta removed their assignment Jan 27, 2016
@scolapasta scolapasta removed this from the Not Assigned to a Release milestone Jan 28, 2016
@mheppler
Copy link
Contributor Author

mheppler commented Sep 7, 2016

Related to #3288 #3333 #3334 #3335

@pdurbin pdurbin added the User Role: Guest Anyone using the system, even without an account label Jul 4, 2017
@mheppler mheppler changed the title Dataset - Friendly File MIME Type Display Names Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names Apr 1, 2019
landreev added a commit that referenced this issue Jun 4, 2019
… search facets and default thumbnail icons.

(ref #2202)
@scolapasta scolapasta assigned pdurbin and landreev and unassigned landreev Jun 5, 2019
@landreev landreev removed their assignment Jun 5, 2019
@pdurbin
Copy link
Member

pdurbin commented Jun 5, 2019

At standup I said I wanted to to check if I had documented the new file type redetect API endpoint I added (phew, done already) and I see that @landreev just pushed a release note in ef40804 which looks good. I just moved this to QA. Also looked at the recent code-related commits that @landreev made since I last touched the branch and they all look good to me too.

@landreev
Copy link
Contributor

landreev commented Jun 7, 2019

Something I should've done earlier - notes on how to test/what to look for:
There is more than one area of where things were improved:

  1. The new API for re-identifying the types of files currently stored as unknown (mime type: "application/octet-stream") in the database. The API is /api/files/<FILEID>/redetect. Until this api is actually run in prod., the number that appears as "Unknown" in the type facets will not change. This API cannot be tested on the vm5 copy of the database - since it needs to read the actual files; and we don't want to point vm5 to the prod. s3 bucket. But it can be tested on some select files.
  2. Better rules for classifying known mime types for the type facets indexing. This part can be tested on vm5 - a full reindex should affect the facet numbers, most notably:
    the misleading "Application" (30K files in prod. currently) facet should disappear completely;
    "Zip" facet (8K in prod.) should go away, replaced by "Archive", showing a higher number (all compressed and archived formats will be indexed under this type);
    A new facet "Code" should appear, with a sizeable number of files (20K+)
    A new facet "Other", with a relatively small number of files. (this is for the files previously indexed under the "Application" facet, that haven't been reclassified under more informative groupings).
  3. More file types should have "friendly" type descriptions (as appear on the dataset and dataverse pages). See the diff on the MimeTypeDisplay.properties file.
  4. Jhove should do a better job identifying some file types. The recommended way of testing this is by uploading files via the API, to take the browser and the OS out of the picture. File types to try: png, gzipped. Changing/stripping the .png and .gz filename extensions would ensure that the type is identified by the contents, and not by the extension.
  5. The list of recognized filename extensions used to guess the content type has been extended. See the diff on MimeTypeDetectionByFileExtension.properties.
  6. It may be worth confirming that Mike's type-specific default thumbnails are still working properly - the code that selects those have been reorganized as part of this PR too.

@djbrooke djbrooke assigned kcondon and unassigned landreev Jun 10, 2019
kcondon added a commit that referenced this issue Jun 11, 2019
 Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names #2202
@mheppler
Copy link
Contributor Author

Peaked at the icons mentioned in "6" and suggested tweaks for data and archive icons. Put my random selection of 84 unknown files into dvn-build and Data went last (2 of 84), to first (18 of 84). The unknowns were still pretty high, but hopefully we see greater gains in the full 127,109 pool of unknowns in production since all 84 of those files were unknowns there originally.

Screen Shot 2019-06-11 at 12 25 51 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants