Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking for unused files which are not in the manifest #58

Closed
rdeltour opened this issue Oct 15, 2013 · 18 comments
Closed

Checking for unused files which are not in the manifest #58

rdeltour opened this issue Oct 15, 2013 · 18 comments

Comments

@rdeltour
Copy link
Member

From dau...@gmail.com on March 12, 2010 20:13:51

This is an extension of issue #38 . ePubCheck does not currently detect
files that are in the zip package but not the manifest, if these files are
not referenced from anywhere in the ePub.

A major new player in the eBook marketplace is now checking this, and
rejecting titles with "unmanifested" files. I know of cases where full
print PDFs accidentally ended up in the ePub zip. This doesn't really cause
any problems, but it does increase file size.

Dave Cramer
Hachette Book Group

Original issue: http://code.google.com/p/epubcheck/issues/detail?id=58

@rdeltour
Copy link
Member Author

From atuleshc...@gmail.com on April 28, 2010 02:15:47

Hi Dave,

Do you have a solution for this other than epubcheck or do you want any.

Thanks
Atulesh

@rdeltour
Copy link
Member Author

From garthcon...@gmail.com on August 03, 2010 12:08:37

Indeed Apple does have this requirement, but I don't believe it's really required by EPUB. So, at most I'd think this should be a warning.

Also, we have to be careful of files in META-INF -- certainly container.xml needs to get a pass, as would the rights and encryption files. Should all files in META-INF get a pass? Some folks are using that location for additional metadata, which may be whacky, but seems technically valid.

@rdeltour
Copy link
Member Author

From dau...@gmail.com on August 03, 2010 13:20:49

I think a warning for unmanifested files not in META-INF is reasonable. It's unlikely to be intentional...

The other tricky case is .DS_STORE files. I have fun with these--if Apple complains about them, I know who to blame!

Another common stray is thumbs.db files, but that's just probably due to poor workflow hygiene.

Dave

@rdeltour
Copy link
Member Author

From dave...@gmail.com on August 03, 2010 15:19:44

I think the point that it's likely to be unintentional is exactly the reason to make it an error instead of a warning. We have seen many cases where .html.bak files are left around; one time we found a PDF of the entire book. One of the most important reasons to have and require a manifest is to document what is intentionally present in the container. Whether this requirement gets added to the spec or not, Apple will continue to reject any epub that has unmanifested files; it is too dangerous to allow unintentional content to pass through.

For .DS_Store and thumbs.db files, we currently make an exception and let them through, because they are just artifacts. But ideally those shouldn't be in the epub either, as they are only taking up space there for no good reason. I'd be willing to make those a warning.

@rdeltour
Copy link
Member Author

From garthcon...@gmail.com on August 03, 2010 15:57:30

Dave, please comment (just for interest sake) on who currently gets a pass. All files in META-INF?

@rdeltour
Copy link
Member Author

From liza31337@gmail.com on August 03, 2010 15:59:42

Status: Started
Owner: liza31337

@rdeltour
Copy link
Member Author

From dave...@gmail.com on August 03, 2010 17:09:24

In our system, we currently only give a pass to files in META-INF whose names are in the spec: container.xml, manifest.xml, metadata.xml, signatures.xml, encryption.xml, rights.xml.

There is also the known wart of our own system regarding the file iTunesMetadata.plist that we stick in the top-level of the zip file. We've made an allowance for that, right now, though that's clearly not consistent with our requirement that all files be in the manifest.

I think that if the spec were to explicitly allow someplace (META-INF? a specific subdirectory of META-INF?) in which arbitrary metadata could be stored, I might be able to make a case to put our iTunesMetadata.plist file in that location instead; and along with that, we could potentially allow other arbitrary metadata files to live in that location.

@rdeltour
Copy link
Member Author

From dau...@gmail.com on August 03, 2010 17:38:17

Most retailers and distributors I've dealt with do not distinguish between warnings and errors in ePubCheck. Either one results in the file being rejected.

I'm good with unmanifested files being an error, with the exceptions noted by Dave M.

Dave C.

@rdeltour
Copy link
Member Author

From garthcon...@gmail.com on August 03, 2010 20:39:48

I hate to be a stick in the mud, but this really isn't an error. However, I do think, as Dave an others have pointed out, this "feature" is generally, currently, used by accident, so a warning might be reasonable. If some workflows want to treat warnings with the same veracity as they do errors, so be it.

But, for example, if an EPUB contains multiple renditions of a publication (multiple elements in the container.xml), the non-OPS renditions would, by definition, not be included in the OPF . This is clearly not an error -- e.g., an EPUB contains both an OPS and a PDF rendition. Also, note that during the EPUB 2.1 effort, it is possible that multiple OPS renditions of a publication may become more standard, for encapsulating both platform-targeted and re-flowable versions of the publication -- each OPF manifest would contain only its members (not the union of all).

Also note, there currently exist EPUB work flows that encapsulate additional files. For example, Sony periodicals:

-- Container.xml specifies two root files. One is opf and another is atom.xml
-- The atom.xml contains all metadata of articles/sections.
-- Since a package contains two different content, we have extended metadata.xml and the package metadata is described in the file.

I have requested more information on the above, so we can add this to our thinking.

But, clearly it seems that the "known" files in META-INF need to be given a pass, as well as any alternate renditions referenced by "container.xml" (but that only works if those renditions are single-files, which is not required).

@rdeltour
Copy link
Member Author

From liza31337@gmail.com on August 03, 2010 21:06:51

We're proceeding with generating a warning only here. I'm undecided on whether to whitelist the explicit list of known files in META-INF, or anything in that directory. Since the use of alternate renderings is rare, I think we'll start with generating a warning for all files not in either the container.xml or the OPF, and let people file tickets if that's insufficient (due to cases where the alternate renditions themselves reference other files).

@rdeltour
Copy link
Member Author

From garthcon...@gmail.com on August 03, 2010 21:16:06

And, at least the other explicitly allowed files in META-INF, I'd expect. :-)

@rdeltour
Copy link
Member Author

From liza31337@gmail.com on August 04, 2010 12:12:58

Of course, that wasn't in contention here.

@rdeltour
Copy link
Member Author

From garthcon...@gmail.com on August 09, 2010 21:05:42

I have some comments from Sony regarding their usage (it seems as though this usage is largely spec-compliant, so we should endeavor not to "invalidate" such usage) -- see below:

Attached are the samples. I think it is the best to make the usage clear. We don't specify the name of Atom file. Any file name is OK as long as it is written in container.xml.

OPF file does not refer to Atom file, because it is not a part of OPF based epub content. (It is really hard to describe...)

We could use RSS, instead of Atom, but since ePub file format likes namespace based formats, we chose Atom.

Attachment: atom.xml container.xml metadata.xml

@rdeltour
Copy link
Member Author

From liza31337@gmail.com on August 31, 2010 08:15:28

Issue 38 has been merged into this issue.

@rdeltour
Copy link
Member Author

From liza31337@gmail.com on August 31, 2010 08:24:00

Labels: Priority-High

@rdeltour
Copy link
Member Author

From liza31337@gmail.com on September 27, 2010 20:01:36

Fixed in rev. 135 (version 1.0.6-dev).

Includes three tests for the conditions listed above:

A testdocs/general/ContainerNotOPF.epub
A testdocs/general/MetaInfNotOPF.epub
A testdocs/general/Unmanifested.epub

Status: Fixed

@rdeltour
Copy link
Member Author

From liza31337@gmail.com on October 12, 2010 17:01:55

Issue 79 has been merged into this issue.

@rdeltour
Copy link
Member Author

From garthcon...@gmail.com on August 03, 2010 20:39:48

I hate to be a stick in the mud, but this really isn't an error. However, I do think, as Dave an others have pointed out, this "feature" is generally, currently, used by accident, so a warning might be reasonable. If some workflows want to treat warnings with the same veracity as they do errors, so be it.

But, for example, if an EPUB contains multiple renditions of a publication (multiple <rootfile> elements in the container.xml), the non-OPS renditions would, by definition, not be included in the OPF <manifest>. This is clearly not an error -- e.g., an EPUB contains both an OPS and a PDF rendition. Also, note that during the EPUB 2.1 effort, it is possible that multiple OPS renditions of a publication may become more standard, for encapsulating both platform-targeted and re-flowable versions of the publication -- each OPF manifest would contain only its members (not the union of all).

Also note, there currently exist EPUB work flows that encapsulate additional files. For example, Sony periodicals:

-- Container.xml specifies two root files. One is opf and another is atom.xml
-- The atom.xml contains all metadata of articles/sections.
-- Since a package contains two different content, we have extended metadata.xml and the package metadata is described in the file.

I have requested more information on the above, so we can add this to our thinking.

But, clearly it seems that the "known" files in META-INF need to be given a pass, as well as any alternate renditions referenced by "container.xml" (but that only works if those renditions are single-files, which is not required).

rdeltour added a commit that referenced this issue Dec 22, 2022
EPUBCheck used to report any resource found in the the container but not
listed in the manifest as a warning (since #58 was fixed, in v1.1). But
the EPUB specification does not require that.

This commit downgrades the severity of `OPF-003` to a usage report.

See also #1452
See also w3c/epub-specs#563
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@rdeltour and others