Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names #2202

mheppler · 2015-05-26T21:44:04Z

As referenced in #2192, there are files in production that need friendly MIME Type labels.

The file in question is ./src/main/java/MimeTypeDisplay.properties

We should identify as many of these as possible, and give them friendlier display names that the one that @pdurbin found.

landreev · 2015-06-01T17:56:53Z

Also, I believe we should extend this "friendly name" functionality, to support wild cards.
As in:

image/jpeg=JPEG Image
image/gif=GIF Image
image/bmp=Windows Bitmap Image
image/*=Graphic Image

i.e., we provide friendly names for the types we know about; and a generic name for an image of type image/blah-blah that's not specifically listed.
We can do the same with MS documents and other types of files. Because we'll always be encountering file types we don't know about.

mheppler · 2015-08-18T16:02:47Z

Currently, the File Type values that are delivered from MimeTypeFacets.properties are lower case (see attached). I suggest that we capitalize them.

pdurbin · 2015-09-11T20:15:46Z

@scolapasta I'm passing this to you for a decision of what to do for 4.2.

mheppler · 2016-09-07T14:46:44Z

Related to #3288 #3333 #3334 #3335

… search facets and default thumbnail icons. (ref #2202)

pdurbin · 2019-06-05T16:04:32Z

At standup I said I wanted to to check if I had documented the new file type redetect API endpoint I added (phew, done already) and I see that @landreev just pushed a release note in ef40804 which looks good. I just moved this to QA. Also looked at the recent code-related commits that @landreev made since I last touched the branch and they all look good to me too.

…le (?) (#2202)

landreev · 2019-06-07T20:49:55Z

Something I should've done earlier - notes on how to test/what to look for:
There is more than one area of where things were improved:

The new API for re-identifying the types of files currently stored as unknown (mime type: "application/octet-stream") in the database. The API is /api/files/<FILEID>/redetect. Until this api is actually run in prod., the number that appears as "Unknown" in the type facets will not change. This API cannot be tested on the vm5 copy of the database - since it needs to read the actual files; and we don't want to point vm5 to the prod. s3 bucket. But it can be tested on some select files.
Better rules for classifying known mime types for the type facets indexing. This part can be tested on vm5 - a full reindex should affect the facet numbers, most notably:
the misleading "Application" (30K files in prod. currently) facet should disappear completely;
"Zip" facet (8K in prod.) should go away, replaced by "Archive", showing a higher number (all compressed and archived formats will be indexed under this type);
A new facet "Code" should appear, with a sizeable number of files (20K+)
A new facet "Other", with a relatively small number of files. (this is for the files previously indexed under the "Application" facet, that haven't been reclassified under more informative groupings).
More file types should have "friendly" type descriptions (as appear on the dataset and dataverse pages). See the diff on the MimeTypeDisplay.properties file.
Jhove should do a better job identifying some file types. The recommended way of testing this is by uploading files via the API, to take the browser and the OS out of the picture. File types to try: png, gzipped. Changing/stripping the .png and .gz filename extensions would ensure that the type is identified by the contents, and not by the extension.
The list of recognized filename extensions used to guess the content type has been extended. See the diff on MimeTypeDetectionByFileExtension.properties.
It may be worth confirming that Mike's type-specific default thumbnails are still working properly - the code that selects those have been reorganized as part of this PR too.

…y Mike. (#2202)

Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names #2202

mheppler · 2019-06-11T16:30:43Z

Peaked at the icons mentioned in "6" and suggested tweaks for data and archive icons. Put my random selection of 84 unknown files into dvn-build and Data went last (2 of 84), to first (18 of 84). The unknowns were still pretty high, but hopefully we see greater gains in the full 127,109 pool of unknowns in production since all 84 of those files were unknowns there originally.

mheppler added UX & UI: Design This issue needs input on the design of the UI and from the product owner Feature: File Upload & Handling labels May 26, 2015

mheppler mentioned this issue May 26, 2015

Citation: Remove MD5s, if you have UNF #2192

Closed

scolapasta added this to the Candidates for 4.0.3 milestone Jun 1, 2015

scolapasta modified the milestones: 4.2, Candidates for 4.2 Jul 15, 2015

pdurbin assigned scolapasta Sep 11, 2015

scolapasta modified the milestones: Candidates for 4.3, 4.2 Sep 17, 2015

mercecrosas modified the milestones: Candidates for 4.3, In Review Nov 30, 2015

scolapasta removed their assignment Jan 27, 2016

scolapasta added Status: Triaged and removed Status: Dev labels Jan 28, 2016

scolapasta removed this from the Not Assigned to a Release milestone Jan 28, 2016

raprasad mentioned this issue Aug 16, 2016

Fix 8.5% of unknown files; Re-ingest 10,000+ Excel files (.xlsx) #3288

Closed

2 tasks

mheppler mentioned this issue Sep 7, 2016

12,400+ of "unknown files" are ".xz" - Mark them appropriately #3333

Closed

2 tasks

This was referenced Sep 7, 2016

Fix 6% of unknown files; Classify 7,249 .NSDstat files #3334

Closed

Fix 4.1% of unknown files; Classify 5,156 medical imaging files ( .dcm) #3335

Closed

pdurbin added Help Wanted: Code Mentor: pdurbin and removed Triaged labels Jun 25, 2017

pdurbin added the User Role: Guest Anyone using the system, even without an account label Jul 4, 2017

mheppler changed the title ~~Dataset - Friendly File MIME Type Display Names~~ Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names Apr 1, 2019

mheppler removed the Help Wanted: Code label Apr 23, 2019

landreev added a commit that referenced this issue Jun 4, 2019

extra code in the redetect type command, to read non-local files (#2202)

edcfad3

landreev added a commit that referenced this issue Jun 4, 2019

Final reorganization of the code used to group files by type, for the…

1752b2a

… search facets and default thumbnail icons. (ref #2202)

scolapasta assigned pdurbin and landreev and unassigned landreev Jun 5, 2019

landreev added a commit that referenced this issue Jun 5, 2019

release notes with the upgrade instructions for #2202.

ef40804

landreev removed their assignment Jun 5, 2019

pdurbin removed their assignment Jun 5, 2019

landreev added a commit that referenced this issue Jun 5, 2019

an extra null check, if the page needs to run the method on a null fi…

97155ab

…le (?) (#2202)

landreev added a commit that referenced this issue Jun 7, 2019

fixed a type check to be case-insensitive (#2202)

2de1761

kcondon assigned landreev Jun 7, 2019

djbrooke assigned kcondon and unassigned landreev Jun 10, 2019

landreev added a commit that referenced this issue Jun 11, 2019

better choice of default icons for "data" and "archive", per review b…

7b7dbec

…y Mike. (#2202)

kcondon closed this as completed in f95a627 Jun 11, 2019

kcondon added a commit that referenced this issue Jun 11, 2019

Merge pull request #5853 from IQSS/2202-file-type-facet-fix

86bb329

Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names #2202

djbrooke added this to the 4.15 milestone Jun 11, 2019

djbrooke mentioned this issue Jun 11, 2019

Update Filetypes IQSS/dataverse.harvard.edu#21

Closed

mheppler mentioned this issue Jun 13, 2019

Some excel files don't go through ingest process. #2264

Closed

mheppler mentioned this issue Jan 17, 2020

Dataverse should correctly categorize certain .spx files as geospatial files #6541

Closed

mheppler mentioned this issue Feb 14, 2020

Add extra information for mime type "text/comma-separated-values" #4943

Closed

mheppler mentioned this issue Jan 12, 2021

add more mime types based on use frequency #7502

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names #2202

Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names #2202

mheppler commented May 26, 2015 •

edited by pdurbin

Loading

landreev commented Jun 1, 2015

mheppler commented Aug 18, 2015

pdurbin commented Sep 11, 2015

mheppler commented Sep 7, 2016 •

edited

Loading

pdurbin commented Jun 5, 2019

landreev commented Jun 7, 2019

mheppler commented Jun 11, 2019

Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names #2202

Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names #2202

Comments

mheppler commented May 26, 2015 • edited by pdurbin Loading

landreev commented Jun 1, 2015

mheppler commented Aug 18, 2015

pdurbin commented Sep 11, 2015

mheppler commented Sep 7, 2016 • edited Loading

pdurbin commented Jun 5, 2019

landreev commented Jun 7, 2019

mheppler commented Jun 11, 2019

mheppler commented May 26, 2015 •

edited by pdurbin

Loading

mheppler commented Sep 7, 2016 •

edited

Loading