Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for extracting hard links #102

Merged
merged 1 commit into from
Apr 24, 2021
Merged

Support for extracting hard links #102

merged 1 commit into from
Apr 24, 2021

Conversation

StefanKarpinski
Copy link
Member

It turns out that implementing hard links is easy: you just copy the hard link target instead of creating the file directly. The tricky part is making sure that a malicious tarball doesn't do something dangerous with a combination of hard links and symlinks, and this can get very tricky indeed. Closes #101.

I'm also including a couple of unrelated helper changes here:

  • Tar.list with callback: allow passing raw header data — this is useful for analyzing wild tarballs
  • Header(hdr) constructor with keywords to modify hdr — I need to copy a header and this is convenient

@codecov
Copy link

codecov bot commented Apr 17, 2021

Codecov Report

Merging #102 (c4b28fb) into master (3708fea) will increase coverage by 0.17%.
The diff coverage is 97.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #102      +/-   ##
==========================================
+ Coverage   96.30%   96.47%   +0.17%     
==========================================
  Files           4        4              
  Lines         649      681      +32     
==========================================
+ Hits          625      657      +32     
  Misses         24       24              
Impacted Files Coverage Δ
src/Tar.jl 96.10% <ø> (ø)
src/extract.jl 96.61% <97.14%> (+0.23%) ⬆️
src/create.jl 97.19% <100.00%> (+0.09%) ⬆️
src/header.jl 94.44% <100.00%> (+0.24%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3708fea...c4b28fb. Read the comment docs.

@StefanKarpinski
Copy link
Member Author

For posterity, here's an untested cut at checking that a hardlink is safe in full generality:

diff --git a/src/extract.jl b/src/extract.jl
index 0d72af0..8593f8a 100644
--- a/src/extract.jl
+++ b/src/extract.jl
@@ -351,6 +351,60 @@ function read_tarball(
             path = isempty(path) ? part : "$path/$part"
         end
         paths[path] = hdr.type == :symlink ? hdr.link : hdr.type
+        # check that hard links refer to already-seen files
+        if hdr.type == :hardlink
+            link_parts = split(hdr.link, '/')
+            target_parts = String[]
+            while !isempty(link_parts)
+                part = popfirst!(link_parts)
+                (isempty(part) || part == ".") && continue
+                if part == ".."
+                    isempty(target) && error("""
+                    Refusing to extract hardlink with external target
+                     * path: $(repr(hdr.path))
+                     * hardlink: $(repr(hdr.link))
+                    """)
+                    pop!(target)
+                    continue
+                end
+                target = join(target_parts, '/')
+                type = get(paths, target, nothing)
+                type === nothing && error("""
+                Refusing to extract hardlink with non-existent target
+                 * path: $(repr(hdr.path))
+                 * hardlink: $(repr(hdr.link))
+                """)
+                if !isempty(link_parts)
+                    # prefix of final link target
+                    if type == :symlink
+                        link = paths[target]
+                        link[1] == '/' && error("""
+                        Refusing to extract hardlink with absolute symlink prefix
+                         * path: $(repr(hdr.path))
+                         * hardlink: $(repr(hdr.link))
+                        """)
+                        prepend!(link_parts, split(link, '/'))
+                    elseif type != :directory
+                        error("""
+                        Refusing to extract hardlink with $type prefix
+                         * path: $(repr(hdr.path))
+                         * hardlink: $(repr(hdr.link))
+                        """)
+                    end
+                else
+                    # final link target
+                    type == :file || error("""
+                    Refusing to extract hardlink to a $type
+                     * path: $(repr(hdr.path))
+                     * hardlink: $(repr(hdr.link))
+                    """)
+                    if hdr.link != target
+                        hdr = Header(hdr; link=target)
+                    end
+                end
+            end
+        end
+        # apply callback, checking that it consumes IO correctly
         before = applicable(position, tar) ? position(tar) : 0
         callback(hdr, split(path, '/', keepempty=false))
         applicable(position, tar) || continue

I've decided that this is too complex and not worth it. For a first cut, I will add support for extracting hard links but with the restriction that they can only refer to (already-extracted) file paths. I suspect this will be what actually occurs 99.99% of the time. What the above complex logic allows for is a hard link which refers to an already-extracted file via some already-extracted symlink prefix, which would work but makes it much harder to ensure that something bad isn't happening.

@StefanKarpinski StefanKarpinski force-pushed the sk/hardlinks branch 3 times, most recently from d7c1233 to a9140a8 Compare April 18, 2021 00:17
@maleadt
Copy link

maleadt commented Apr 19, 2021

Thanks, works great. The restriction doesn't pose a problem for the tarballs I'm using.

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Apr 19, 2021

I would be surprised if there's any way to get command-line tar tools to create a tarball that doesn't have this property. I haven't looked at the GNU tar source but presumably it remembers the device & inode of each file it archives with hard link count larger than one, and if later while generating the same archive it encounters another file with the same device & inode, it can emit a hardlink to the earlier archive path instead. That guarantees that the hard link target is a plain file that appears in the tarball before the hard link to it.

@StefanKarpinski StefanKarpinski force-pushed the sk/hardlinks branch 6 times, most recently from f9bbb4e to 411e8cf Compare April 23, 2021 17:47
@StefanKarpinski
Copy link
Member Author

Finally, this works.

This adds support for hardlinks, including:

- extracting them by copying the linked file (no hardlink created)
- tree hashing them as they are extracted
- rewriting by duplicating the linked file

This only supports hardlinks whose target is a plain file that has
already been seen in the tarball that is being processed. You cannot
have a hardlink that appears before the file that is linked. If the
target of a hardlink is overwritten later, the link copies the current
version of the file at the time of extraction. Tree hashing and rewrite
are both consistent with this behavior. It is not supported to extract
hardlinks where the link involves symlinks, even if the link refers to a
path that would be a file — the target must be a plain file.

Close #101.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unsupported entry type: hardlink
2 participants