Skip to content

Commit

Permalink
support for hardlinks: extract, tree_hash, rewrite (#102)
Browse files Browse the repository at this point in the history
  • Loading branch information
StefanKarpinski committed Apr 24, 2021
1 parent 3708fea commit b0ddf51
Show file tree
Hide file tree
Showing 6 changed files with 143 additions and 35 deletions.
31 changes: 25 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,13 @@ is encountered while extracting `tarball` and the entry is only extracted if the
an archive, to skip entries that cause `extract` to throw an error, or to record
what is extracted during the extraction process.

Before it is passed to the predicate function, the `Header` object is somewhat
modified from the raw header in the tarball: the `path` field is normalized to
remove `.` entries and replace multiple consecutive slashes with a single slash.
If the entry has type `:hardlink`, the link target path is normalized the same
way so that it will match the path of the target entry; the size field is set to
the size of the target path (which must be an already-seen file).

If the `skeleton` keyword is passed then a "skeleton" of the extracted tarball
is written to the file or IO handle given. This skeleton file can be used to
recreate an identical tarball by passing the `skeleton` keyword to the `create`
Expand Down Expand Up @@ -156,6 +163,13 @@ is encountered while extracting `old_tarball` and the entry is skipped unless
an archive, to skip entries that would cause `extract` to throw an error, or to
record what content is encountered during the rewrite process.

Before it is passed to the predicate function, the `Header` object is somewhat
modified from the raw header in the tarball: the `path` field is normalized to
remove `.` entries and replace multiple consecutive slashes with a single slash.
If the entry has type `:hardlink`, the link target path is normalized the same
way so that it will match the path of the target entry; the size field is set to
the size of the target path (which must be an already-seen file).

### Tar.tree_hash

```jl
Expand Down Expand Up @@ -187,6 +201,13 @@ is encountered while processing `tarball` and an entry is only hashed if
archive, to skip entries that cause `extract` to throw an error, or to record
what is extracted during the hashing process.

Before it is passed to the predicate function, the `Header` object is somewhat
modified from the raw header in the tarball: the `path` field is normalized to
remove `.` entries and replace multiple consecutive slashes with a single slash.
If the entry has type `:hardlink`, the link target path is normalized the same
way so that it will match the path of the target entry; the size field is set to
the size of the target path (which must be an already-seen file).

Currently supported values for `algorithm` are `git-sha1` (the default) and
`git-sha256`, which uses the same basic algorithm as `git-sha1` but replaces the
SHA1 hash function with SHA2-256, the hash function that git will transition to
Expand Down Expand Up @@ -362,18 +383,16 @@ supports only the following file types:
* plain files
* directories
* symlinks
* hardlinks (extracted as copies)

The `Tar` package does not support other file types that the TAR format can
represent, including: hard links, character devices, block devices, and FIFOs.
If you attempt to create or extract an archive that contains any of these kinds
of entries, `Tar` will raise an error. You can, however, list the contents of a
represent, including: character devices, block devices, and FIFOs. If you
attempt to create or extract an archive that contains any of these kinds of
entries, `Tar` will raise an error. You can, however, list the contents of a
tarball containing other kinds of entries by passing the `strict=false` flag to
the `list` function; without this option, `list` raises the same error as
`extract` would.

In the future, optional support may be added for using hard links within
archives to avoid duplicating identical files.

### Time Stamps

Also in accordance with its design goal as a data transfer tool, the `Tar`
Expand Down
21 changes: 21 additions & 0 deletions src/Tar.jl
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,13 @@ is encountered while extracting `tarball` and the entry is only extracted if the
an archive, to skip entries that cause `extract` to throw an error, or to record
what is extracted during the extraction process.
Before it is passed to the predicate function, the `Header` object is somewhat
modified from the raw header in the tarball: the `path` field is normalized to
remove `.` entries and replace multiple consecutive slashes with a single slash.
If the entry has type `:hardlink`, the link target path is normalized the same
way so that it will match the path of the target entry; the size field is set to
the size of the target path (which must be an already-seen file).
If the `skeleton` keyword is passed then a "skeleton" of the extracted tarball
is written to the file or IO handle given. This skeleton file can be used to
recreate an identical tarball by passing the `skeleton` keyword to the `create`
Expand Down Expand Up @@ -251,6 +258,13 @@ is encountered while extracting `old_tarball` and the entry is skipped unless
`predicate(hdr)` is true. This can be used to selectively rewrite only parts of
an archive, to skip entries that would cause `extract` to throw an error, or to
record what content is encountered during the rewrite process.
Before it is passed to the predicate function, the `Header` object is somewhat
modified from the raw header in the tarball: the `path` field is normalized to
remove `.` entries and replace multiple consecutive slashes with a single slash.
If the entry has type `:hardlink`, the link target path is normalized the same
way so that it will match the path of the target entry; the size field is set to
the size of the target path (which must be an already-seen file).
"""
function rewrite(
predicate::Function,
Expand Down Expand Up @@ -301,6 +315,13 @@ is encountered while processing `tarball` and an entry is only hashed if
archive, to skip entries that cause `extract` to throw an error, or to record
what is extracted during the hashing process.
Before it is passed to the predicate function, the `Header` object is somewhat
modified from the raw header in the tarball: the `path` field is normalized to
remove `.` entries and replace multiple consecutive slashes with a single slash.
If the entry has type `:hardlink`, the link target path is normalized the same
way so that it will match the path of the target entry; the size field is set to
the size of the target path (which must be an already-seen file).
Currently supported values for `algorithm` are `git-sha1` (the default) and
`git-sha256`, which uses the same basic algorithm as `git-sha1` but replaces the
SHA1 hash function with SHA2-256, the hash function that git will transition to
Expand Down
15 changes: 12 additions & 3 deletions src/create.jl
Original file line number Diff line number Diff line change
Expand Up @@ -54,10 +54,19 @@ function rewrite_tarball(
end
node = node′
end
if !(hdr.type == :directory && get(node, name, nothing) isa Dict)
node[name] = (hdr, position(old_tar))
if hdr.type == :hardlink
node′ = tree
for part in split(hdr.link, '/')
node′ = node′[part]
end
hdr′ = Header(node′[1], path=hdr.path, mode=hdr.mode)
node[name] = (hdr′, node′[2])
else
if !(hdr.type == :directory && get(node, name, nothing) isa Dict)
node[name] = (hdr, position(old_tar))
end
skip_data(old_tar, hdr.size)
end
skip_data(old_tar, hdr.size)
end
write_tarball(new_tar, tree, buf=buf) do node, tar_path
if node isa Dict
Expand Down
87 changes: 65 additions & 22 deletions src/extract.jl
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,16 @@ function extract_tarball(
mkdir(sys_path)
elseif hdr.type == :symlink
copy_symlinks || symlink(hdr.link, sys_path)
elseif hdr.type == :hardlink
src_path = joinpath(root, hdr.link)
cp(src_path, sys_path)
elseif hdr.type == :file
read_data(tar, sys_path, size=hdr.size, buf=buf)
else # should already be caught by check_header
error("unsupported tarball entry type: $(hdr.type)")
end
# apply tarball permissions
if hdr.type in (:file, :hardlink)
exec = 0o100 & hdr.mode != 0
tar_mode = exec ? 0o755 : 0o644
sys_mode = filemode(sys_path)
Expand All @@ -93,21 +101,19 @@ function extract_tarball(
# we don't have a way to do that afaik
end
chmod(sys_path, tar_mode & sys_mode)
else # should already be caught by check_header
error("unsupported tarball entry type: $(hdr.type)")
end
end
copy_symlinks || return

# resolve the internal targets of symlinks
for (path, what) in paths
what isa AbstractString || continue
what isa String || continue
target = link_target(paths, path, what)
paths[path] = something(target, :symlink)
end

# follow chains of symlinks
follow(seen::Vector, what::Symbol) =
follow(seen::Vector, what::Any) =
what == :symlink ? what : seen[end]
follow(seen::Vector, what::String) =
what in seen ? :symlink : follow(push!(seen, what), paths[what])
Expand Down Expand Up @@ -159,7 +165,7 @@ end

# resolve symlink target or nothing if not valid
function link_target(
paths::Dict{String,Union{String,Symbol}},
paths::Dict{String},
path::AbstractString,
link::AbstractString,
)
Expand Down Expand Up @@ -220,12 +226,18 @@ function git_tree_hash(
node[name] = Dict{String,Any}()
end
return
end
if hdr.type == :symlink
elseif hdr.type == :symlink
mode = "120000"
hash = git_object_hash("blob", HashType) do io
write(io, hdr.link)
end
elseif hdr.type == :hardlink
mode = iszero(hdr.mode & 0o100) ? "100644" : "100755"
node′ = tree
for part in split(hdr.link, '/')
node′ = node′[part]
end
hash = node′[2] # hash of linked file
elseif hdr.type == :file
mode = iszero(hdr.mode & 0o100) ? "100644" : "100755"
hash = git_file_hash(tar, hdr.size, HashType, buf=buf)
Expand Down Expand Up @@ -332,31 +344,62 @@ function read_tarball(
)
write_skeleton_header(skeleton, buf=buf)
# symbols for path types except symlinks store the link
paths = Dict{String,Union{Symbol,String}}()
paths = Dict{String,Any}()
globals = Dict{String,String}()
while !eof(tar)
hdr = read_header(tar, globals=globals, buf=buf, tee=skeleton)
hdr === nothing && break
# check if we should extract or skip
if !predicate(hdr)
skip_data(tar, hdr.size)
continue
end
check_header(hdr)
err = nothing
# normalize path and check for symlink attacks
path = ""
for part in split(hdr.path, '/')
# check_header checks for ".." later
(isempty(part) || part == ".") && continue
# check_header doesn't allow ".." in path
get(paths, path, nothing) isa String && error("""
Refusing to extract path with symlink prefix, possible attack
* path to extract: $(repr(hdr.path))
* symlink prefix: $(repr(path))
""")
isempty(path) || (paths[path] = :directory)
if err === nothing && get(paths, path, nothing) isa String
err = """
Tarball contains path with symlink prefix:
- path = $(repr(hdr.path))
- prefix = $(repr(path))
Refusing to extract — possible attack!
"""
end
path = isempty(path) ? part : "$path/$part"
end
paths[path] = hdr.type == :symlink ? hdr.link : hdr.type
hdr′ = Header(hdr, path=path)
# check that hardlinks refer to already-seen files
if err === nothing && hdr.type == :hardlink
parts = filter!(split(hdr.link, '/')) do part
# check_header checks for ".." later
!isempty(part) && part != "."
end
link = join(parts, '/')
hdr = Header(hdr, link=link)
hdr′ = Header(hdr′, link=link)
what = get(paths, link, Symbol("non-existent"))
if what isa Integer # plain file
hdr′ = Header(hdr′, size=what)
else
err = """
Tarball contains hardlink with $what target:
- path = $(repr(hdr.path))
- target = $(repr(hdr.link))
Refusing to extract — possible attack!
"""
end
end
# check if we should extract or skip
if !predicate(hdr′) # pass normalized header
skip_data(tar, hdr.size)
continue
end
check_header(hdr)
err === nothing || error(err)
# record info about path
paths[path] =
hdr.type == :symlink ? hdr.link :
hdr.type == :file ? hdr.size :
hdr.type
# apply callback, checking that it consumes IO correctly
before = applicable(position, tar) ? position(tar) : 0
callback(hdr, split(path, '/', keepempty=false))
applicable(position, tar) || continue
Expand Down
12 changes: 9 additions & 3 deletions src/header.jl
Original file line number Diff line number Diff line change
Expand Up @@ -99,12 +99,18 @@ function check_header(hdr::Header)
err("path is absolute")
occursin(r"(^|/)\.\.(/|$)", hdr.path) &&
err("path contains '..' component")
hdr.type in (:file, :symlink, :directory) ||
hdr.type in (:file, :hardlink, :symlink, :directory) ||
err("unsupported entry type")
hdr.type (:hardlink, :symlink) && !isempty(hdr.link) &&
err("non-link with link path")
hdr.type == :symlink && hdr.size != 0 &&
err("symlink with non-zero size")
hdr.type (:hardlink, :symlink) && isempty(hdr.link) &&
err("$(hdr.type) with empty link path")
hdr.type (:hardlink, :symlink) && hdr.size != 0 &&
err("$(hdr.type) with non-zero size")
hdr.type == :hardlink && hdr.link[1] == '/' &&
err("hardlink with absolute link path")
hdr.type == :hardlink && occursin(r"(^|/)\.\.(/|$)", hdr.link) &&
err("hardlink contains '..' component")
hdr.type == :directory && hdr.size != 0 &&
err("directory with non-zero size")
hdr.type != :directory && endswith(hdr.path, "/") &&
Expand Down
12 changes: 11 additions & 1 deletion test/setup.jl
Original file line number Diff line number Diff line change
Expand Up @@ -61,15 +61,25 @@ function make_test_tarball(tar_create::Function = Tar.create)
dir′ = joinpath(dir, "s"^b)
mkpath(dir′)
push!(paths, dir′)
path = paths[i += 1]
link = joinpath(dir, "l"^b)
target = relpath(paths[i += 1], link)
target = relpath(path, link)
symlink(target, link)
push!(paths, link)
broken = joinpath(dir, "b"^b)
if target != "."
symlink(chop(target), broken)
push!(paths, broken)
end
isfile(path) || continue
hard = joinpath(dir, "h"^b)
mode = isodd(i) ? 0o755 : 0o644
if Sys.which("ln") !== nothing
run(`ln $path $hard`)
else
cp(path, hard)
end
chmod(hard, mode)
end
end
end
Expand Down

0 comments on commit b0ddf51

Please sign in to comment.