Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive RAM for moderate problem #254

Closed
cossio opened this issue Dec 14, 2018 · 36 comments · Fixed by #618
Closed

Massive RAM for moderate problem #254

cossio opened this issue Dec 14, 2018 · 36 comments · Fixed by #618

Comments

@cossio
Copy link

cossio commented Dec 14, 2018

I have an example of a moderately sized problem where problem formulation is consuming massive memory. I have no idea why this happens. Is there something inherently difficult to this kind of problem? Or is this a bug in Convex.jl?

Note that this is not an issue with any solver. The RAM spike occurs at the problem formulation step, the line problem = maximize(...) below.

To run this code, you need to load the variables N, Adj from the .jld2 file (using JLD2.jl) here:

https://github.com/cossio/cvx_example_data

N is an float 3-dimensional Array, and Adj a sparse matrix.

WARNING: The following code took so much memory that I had to reboot my laptop. Run this in a contained environment where you can constrain memory consumption. (For example, using https://github.com/pshved/timeout, start julia with timeout -m 4000000 julia to limit memory usage to 4GB.)

x = Variable(2730)
S, V, T = size(N)
lN = log.(N)
M = vec(sum(N[:,:,2:T]; dims=(2,3)))
H = Adj * x
problem = maximize(-sum(logsumexp(lN[:,v,t] - H) for t = 1:T-1, v = 1:V) - dot(M, H), [x  -1e2, x  1e2])
solve!(problem, ECOSSolver())

UPDATE: I'm now getting the RAM spike at the solve phase, instead of the formulation phase. Both ECOS and SCS have the same problem.

@cossio
Copy link
Author

cossio commented Dec 14, 2018

@ararslan I'm now seeing the memory spike in the solve phase instead of the formulation. I'll close this unless something similar comes up again. I'll open an issue at ECOS or SCS.

@cossio cossio closed this as completed Dec 14, 2018
@ararslan
Copy link
Contributor

Okay, thanks for reporting back. Did you see the spike only with a particular solver? If so, it would be good to know which.

@cossio
Copy link
Author

cossio commented Dec 14, 2018

@ararslan I just tested with ECOS and SCS, both consume > 4GB RAM (at which point I cancelled to avoid having to reboot my laptop again)

@ararslan
Copy link
Contributor

If it's both of them, that suggest to me that the issue may still be Convex's fault. Have you by chance done any profiling using the Profile stdlib package?

@cossio
Copy link
Author

cossio commented Dec 14, 2018

@ararslan Problem is my laptop dies before I can complete the profiling. Any suggestions?

@ararslan
Copy link
Contributor

Oh that makes sense. Hm, I don't know what to suggest offhand... I wonder if maybe the work could be put into a Task then after a certain amount of time, send an interrupt to the task. Not sure if that would work though.

@cossio
Copy link
Author

cossio commented Dec 15, 2018

I just tried limiting to 10GB ram at a server and it also crashes. Both SCS and ECOS.
I don't know how to do the Task thing, and I'm really out of ideas on how to diagnose what's going on.

In case it's really an issue with Convex I'll reopen this. Maybe someone else thinks of something.
But feel free to close you think so.

@cossio cossio reopened this Dec 15, 2018
@cossio cossio changed the title Massive RAM for problem formulation Massive RAM for moderate problem Dec 15, 2018
@cossio
Copy link
Author

cossio commented Dec 15, 2018

@ararslan I started julia with a mem constrain of 10 GB using:

timeout -m 10000000 julia --track-allocation=user

This tracked memory allocations across all of Convex.jl as the RAM usage increased, until it was killed at 10GB. The .mem files are at https://github.com/cossio/cvx_example_data/blob/master/convex_allocations.tar.gz.

I ran this example with SCS. The .mem files of SCS.jl are at https://github.com/cossio/cvx_example_data/blob/master/SCS_allocation_analysis.tar.gz

Let me know if that helps, or if I can do anything else.

BTW, is there an utility to explore a directory tree like this with many .mem files to find the highest allocs?

@ararslan
Copy link
Contributor

BTW, is there an utility to explore a directory tree like this with many .mem files to find the highest allocs?

Coverage can do this, actually.

julia> using Coverage

julia> mem = analyze_malloc(".");

julia> mem[end-10:end]
11-element Array{Coverage.MallocInfo,1}:
 Coverage.MallocInfo(9785470, "./variable.jl.mem", 22)
 Coverage.MallocInfo(10224350, "./constant.jl.mem", 45)
 Coverage.MallocInfo(14402371, "./constant.jl.mem", 54)
 Coverage.MallocInfo(15131666, "./atoms/affine/add_subtract.jl.mem", 124)
 Coverage.MallocInfo(22798259, "./atoms/affine/add_subtract.jl.mem", 87)
 Coverage.MallocInfo(23872808, "./utilities/show.jl.mem", 20)
 Coverage.MallocInfo(31561657, "./constant.jl.mem", 22)
 Coverage.MallocInfo(47125096, "./expressions.jl.mem", 65)
 Coverage.MallocInfo(120645411, "./conic_form.jl.mem", 112)
 Coverage.MallocInfo(145535180, "./expressions.jl.mem", 106)
 Coverage.MallocInfo(916530186, "./solution.jl.mem", 12)

Looks like the source of the allocations in solution.jl is:

        - function solve!(problem::Problem,
        -                 s::MathProgBase.AbstractMathProgSolver;
        -                 kwargs...)
        -     # TODO: warn about wiping out old model if warmstart=true?
916530186     problem.model = MathProgBase.ConicModel(s)
        0     return solve!(problem; kwargs...)
        - end

@cossio
Copy link
Author

cossio commented Dec 17, 2018

@ararslan Thanks for the suggestion, Coverage helps.

I inspected the .mem files for MathProgBase (https://github.com/cossio/cvx_example_data/blob/master/mpb_allocation.tar.gz) and they show no allocations. Then I realized the solvers implement their own versions of ConicModel(solver).

In ECOS.jl there is this line that stands out. Seems like this is an issue with the solvers after all.

                - function ver()
100391444         ver_ptr = ccall((:ECOS_ver, ECOS.ecos), Ptr{UInt8}, ())
                - return unsafe_string(ver_ptr)
                - end

It is interesting though, this is function is just to return ECOS version?

@ararslan
Copy link
Contributor

Hm, okay, that's bizarre. You said you encountered the same problem with SCS? Do you have memory allocations for that?

@cossio
Copy link
Author

cossio commented Dec 17, 2018

@ararslan Yes, https://github.com/cossio/cvx_example_data/blob/master/scs_allocation.tar.gz.

julia> mem[end-10:end]
11-element Array{Coverage.MallocInfo,1}:
 Coverage.MallocInfo(0, "./MPBWrapper.jl.mem", 387)      
 Coverage.MallocInfo(0, "./MPBWrapper.jl.mem", 388)      
 Coverage.MallocInfo(0, "./MPBWrapper.jl.mem", 389)      
 Coverage.MallocInfo(0, "./MPBWrapper.jl.mem", 397)      
 Coverage.MallocInfo(0, "./SCS.jl.mem", 22)              
 Coverage.MallocInfo(0, "./SCS.jl.mem", 23)              
 Coverage.MallocInfo(1248, "./MPBWrapper.jl.mem", 36)    
 Coverage.MallocInfo(1440, "./SCS.jl.mem", 21)           
 Coverage.MallocInfo(129136, "./MPBWrapper.jl.mem", 31)  
 Coverage.MallocInfo(4402990, "./SCS.jl.mem", 20)        
 Coverage.MallocInfo(79146051, "./c_wrapper.jl.mem", 124)

Again a version call:

        - function SCS_version()
 79146051     return unsafe_string(ccall((:scs_version, SCS.direct), Cstring, ()))
        - end

julia --track-allocations counts the allocations made by ccalls?

@ararslan
Copy link
Contributor

I'm not sure, actually, though it would appear so. Are these version checks actually called somewhere during the solving process?

@cossio
Copy link
Author

cossio commented Dec 17, 2018

@ararslan I think these solvers print a version banner before starting the iterations. But I'm not sure if that's what we are seeing.

@ararslan
Copy link
Contributor

If I'm not mistaken, it looks like printing banners is done by the C code itself rather than by the Julia wrappers, so I don't think that would be reflected in the allocations tracked by Julia for the version querying Julia function.

@mlubin
Copy link
Member

mlubin commented Dec 18, 2018

If you're trying to isolate Convex.jl's processing from the solver, you can call conic_problem to generate the input without solving, E.g., https://github.com/JuliaOpt/ConicBenchmarkUtilities.jl/blob/4267bfedeace99fcc4a03256eef8d44151f38b95/src/convex_to_cbf.jl#L4.

@cossio
Copy link
Author

cossio commented Dec 18, 2018

Thanks @mlubin! I'm seeing the allocations again, just from the call to conic_problem. So we are back to this not being not an issue with the solvers. Here is what I got.

timeout.pl -m 10000000 julia --track-allocation=user
julia> include("cvx_issue.jl")
julia> con = Convex.conic_problem(problem)   # huge memory alloc

where cvx_issue.jl is this file:

using Convex, FileIO, JLD2, ECOS, SparseArrays, LinearAlgebra
@load "cvx_example.jld2"
x = Variable(2730)
S, V, T = size(N)
lN = log.(N)
M = vec(sum(N[:,:,2:T]; dims=(2,3)))
H = Adj * x
problem = maximize(-sum(logsumexp(lN[:,v,t] - H) for t = 1:T-1, v = 1:V) - dot(M, H), [x ≥ -1e2, x ≤ 1e2])

and cvx_example.jld is here: https://github.com/cossio/cvx_example_data.

Inspecting the resulting .mem files from all my packages (using Coverage), gives this:

julia> analyze_malloc(".")[end-20:end]
21-element Array{Coverage.MallocInfo,1}:
 Coverage.MallocInfo(13571105, "./JLD2/KjBIK/src/groups.jl.mem", 92)                      
 Coverage.MallocInfo(15278217, "./Convex/MLj0c/src/atoms/affine/add_subtract.jl.mem", 124)
 Coverage.MallocInfo(15407963, "./Convex/MLj0c/src/constant.jl.mem", 54)                  
 Coverage.MallocInfo(15721748, "./Convex/MLj0c/src/constant.jl.mem", 22)                  
 Coverage.MallocInfo(15917995, "./JLD2/KjBIK/src/JLD2.jl.mem", 334)                       
 Coverage.MallocInfo(19283630, "./JLD2/KjBIK/src/loadsave.jl.mem", 2)                     
 Coverage.MallocInfo(20910662, "./JLD2/KjBIK/src/JLD2.jl.mem", 352)                       
 Coverage.MallocInfo(22778265, "./JLD2/KjBIK/src/data.jl.mem", 70)                        
 Coverage.MallocInfo(24140413, "./Convex/MLj0c/src/constraints/constraints.jl.mem", 143)  
 Coverage.MallocInfo(26433226, "./Convex/MLj0c/src/atoms/affine/add_subtract.jl.mem", 87) 
 Coverage.MallocInfo(43264136, "./Compat/HVYNa/src/Compat.jl.mem", 310)                   
 Coverage.MallocInfo(46994104, "./Convex/MLj0c/src/expressions.jl.mem", 65)               
 Coverage.MallocInfo(61731328, "./JLD2/KjBIK/src/data.jl.mem", 252)                       
 Coverage.MallocInfo(79002811, "./ECOS/e0Lr6/src/ECOS.jl.mem", 27)                        
 Coverage.MallocInfo(88962353, "./JLD2/KjBIK/src/JLD2.jl.mem", 399)                       
 Coverage.MallocInfo(95562654, "./JLD2/KjBIK/src/loadsave.jl.mem", 57)                    
 Coverage.MallocInfo(107061059, "./JLD2/KjBIK/src/JLD2.jl.mem", 288)                      
 Coverage.MallocInfo(133353998, "./Convex/MLj0c/src/conic_form.jl.mem", 112)              
 Coverage.MallocInfo(151016207, "./Convex/MLj0c/src/expressions.jl.mem", 106)             
 Coverage.MallocInfo(301982888, "./Convex/MLj0c/src/expressions.jl.mem", 120)             
 Coverage.MallocInfo(351788459, "./JLD2/KjBIK/src/data.jl.mem", 7)                        

I ignore JLD2 since that's just loading the data (I presume). Here are the most interesting lines from expressions.jl:

        - ### User-defined Unions
        - const Value = Union{Number, AbstractArray}
        - const ValueOrNothing = Union{Value, Nothing}
        - const AbstractExprOrValue = Union{AbstractExpr, Value}
        -
151016207 convert(::Type{AbstractExpr}, x::Value) = Constant(x)
        - convert(::Type{AbstractExpr}, x::AbstractExpr) = x
        -
        - function size(x::AbstractExpr, dim::Integer)
        -     if dim < 1
        -         error("dimension out of range")
        -     elseif dim > 2
        -         return 1
        -     else
        -         return size(x)[dim]
        -     end
        - end
        -
        - ndims(x::AbstractExpr) = 2
301982888 get_vectorized_size(x::AbstractExpr) = reduce(*, size(x))
        - lastindex(x::AbstractExpr) = get_vectorized_size(x)

and from conic_form.jl

        - function has_conic_form(conic_forms::UniqueConicForms, exp::AbstractExpr)
133353998     return haskey(conic_forms.exp_map, (exp.head, exp.id_hash))
        - end

I'm not that familiar with this code to figure what the issue might be. @ararslan any ideas?

@cossio
Copy link
Author

cossio commented Dec 18, 2018

The top ten allocations just from Convex.jl:

julia> analyze_malloc("./Convex/")[end-10:end]
11-element Array{Coverage.MallocInfo,1}:
 Coverage.MallocInfo(9620766, "./Convex/MLj0c/src/variable.jl.mem", 22)                   
 Coverage.MallocInfo(10386046, "./Convex/MLj0c/src/utilities/show.jl.mem", 20)            
 Coverage.MallocInfo(15278217, "./Convex/MLj0c/src/atoms/affine/add_subtract.jl.mem", 124)
 Coverage.MallocInfo(15407963, "./Convex/MLj0c/src/constant.jl.mem", 54)                  
 Coverage.MallocInfo(15721748, "./Convex/MLj0c/src/constant.jl.mem", 22)                  
 Coverage.MallocInfo(24140413, "./Convex/MLj0c/src/constraints/constraints.jl.mem", 143)  
 Coverage.MallocInfo(26433226, "./Convex/MLj0c/src/atoms/affine/add_subtract.jl.mem", 87) 
 Coverage.MallocInfo(46994104, "./Convex/MLj0c/src/expressions.jl.mem", 65)               
 Coverage.MallocInfo(133353998, "./Convex/MLj0c/src/conic_form.jl.mem", 112)              
 Coverage.MallocInfo(151016207, "./Convex/MLj0c/src/expressions.jl.mem", 106)             
 Coverage.MallocInfo(301982888, "./Convex/MLj0c/src/expressions.jl.mem", 120)             

@ararslan
Copy link
Contributor

So this is interesting:

        - function has_conic_form(conic_forms::UniqueConicForms, exp::AbstractExpr)
133353998     return haskey(conic_forms.exp_map, (exp.head, exp.id_hash))
        - end

Here conic_forms.exp_map is an

OrderedDict{Tuple{Symbol,UInt64},
            OrderedDict{UInt64,Tuple{Union{AbstractArray,Number},
                                     Union{AbstractArray,Number}}}}

I would not be surprised if haskey on this type is horrendously inefficient.

@cossio
Copy link
Author

cossio commented Dec 18, 2018

@ararslan The fact that the value-type of the OrderedDict is not concrete (Union{AbstractArray,Number}) may not be ideal. Can this type be more specific? (just throwing a random idea out there)

@ararslan
Copy link
Contributor

ararslan commented Dec 18, 2018

Unfortunately not without a huge undertaking to restructure the way Convex uses and expresses types. For reference, Value in Convex is actually Union{AbstractArray,Number}. (As an aside, because of this, Convex actually overwrites a lot of Base methods, since it defines methods on Value...)

@iamed2
Copy link
Contributor

iamed2 commented Dec 19, 2018

I would not be surprised if haskey on this type is horrendously inefficient.

I can imagine it being slow but it absolutely should not be allocating much memory. I recommend benchmarking the functions identified here to check if they are actually problems.

Remember this is total memory allocated, so functions that are called a lot will report higher numbers.

These are also bytes allocated, so 301982888 (the highest number) is ~300 MB, which is not massive.

I'm suspicious that compilation memory usage may be in play.

@cossio
Copy link
Author

cossio commented Dec 20, 2018

@iamed2 It seemed strange to me that my system reported more than 10GB allocations, but the .mem files were reporting lower values. So julia --track-allocations=user misses all the allocations made by the JIT compiler? If that's so, what can I do to measure compiler allocs?

@iamed2
Copy link
Contributor

iamed2 commented Dec 20, 2018

I'm saying the opposite; JIT allocations are included so you can't draw conclusions about the code in a function from the allocations number unless you filter them out.

From the docs:

In interpreting the results, there are a few important details. Under the user setting, the first line of any function directly called from the REPL will exhibit allocation due to events that happen in the REPL code itself. More significantly, JIT-compilation also adds to allocation counts, because much of Julia's compiler is written in Julia (and compilation usually requires memory allocation). The recommended procedure is to force compilation by executing all the commands you want to analyze, then call Profile.clear_malloc_data() to reset all allocation counters. Finally, execute the desired commands and quit Julia to trigger the generation of the .mem files.

@cossio
Copy link
Author

cossio commented Dec 20, 2018

I see, I misunderstood then. Then this does not explain why the system reports > 10 GB allocations.

@ararslan
Copy link
Contributor

I ran the code with @profile and interrupted it after a bit, then leafed through the results. Anecdotally it's spending a fair amount of time on sparse matrix operations, but I'm not sure that's a primary offender.

@cossio
Copy link
Author

cossio commented Dec 20, 2018

This bug protects itself. It kills your computer before showing itself. I tried Adj = Matrix(Adj) (which should avoid sparse operations), but got the same huge allocations.

@ararslan
Copy link
Contributor

ararslan commented Dec 20, 2018

I tried that as well, but I'll also note that Convex uses sparse matrices in a number of places regardless of whether any particular input is sparse:

$ sift -e sparse src 
src/variable.jl:77:    return sparse(1.0I, vec_size, vec_size)
src/variable.jl:83:        return im*sparse(1.0I, vec_size, vec_size)
src/conic_form.jl:13:# we store the affine functions in this form for efficient manipulation of sparse affine functions
src/atoms/affine/diagm.jl:64:        coeff = sparse(I, J, 1.0, sz * sz, sz)
src/atoms/affine/transpose.jl:63:        transpose_matrix = sparse(I, J, 1.0)
src/atoms/affine/transpose.jl:126:        transpose_matrix = sparse(I, J, 1.0)
src/atoms/affine/partialtrace.jl:47:            a = sparse(1.0I, 1, 1)
src/atoms/affine/partialtrace.jl:48:            b = sparse(1.0I, 1, 1)
src/atoms/affine/partialtrace.jl:58:                    a = kron(a, sparse(1.0I, dim, dim))
src/atoms/affine/partialtrace.jl:59:                    b = kron(b, sparse(1.0I, dim, dim))
src/atoms/affine/partialtrace.jl:90:            a = sparse(1.0I, 1, 1)
src/atoms/affine/partialtrace.jl:91:            b = sparse(1.0I, 1, 1)
src/atoms/affine/partialtrace.jl:101:                    a = kron(a, sparse(1.0I, dim, dim))
src/atoms/affine/partialtrace.jl:102:                    b = kron(b, sparse(1.0I, dim, dim))
src/atoms/affine/index.jl:68:            index_matrix = sparse(1:sz, J, 1.0, m, n)
src/atoms/affine/index.jl:70:            index_matrix = sparse(1:length(x.inds), x.inds, 1.0, m, n)
src/atoms/affine/multiply_divide.jl:85:            objective = kron(sparse(1.0I, x.size[2], x.size[2]), x.children[1].value) * objective
src/atoms/affine/multiply_divide.jl:89:            objective = kron(x.children[2].value', sparse(1.0I, x.size[1], x.size[1])) * objective
src/atoms/sdp_cone/operatornorm.jl:73:        p = minimize(t, [t*sparse(1.0I, m, m) A; A' t*sparse(1.0I, n, n)] ⪰ 0)

@ararslan
Copy link
Contributor

ararslan commented Dec 21, 2018

I have a local branch of Convex that uses FillArrays and LazyArrays in place of SparseArrays where it makes sense to do so. Running the example in this issue, I'm seeing a peak memory usage of about 15%, which is down from 98% or so using SparseArrays alone.

Edit: It's still running after 12 minutes, and the memory usage has increased to 16%.

@ararslan
Copy link
Contributor

Every time I've interrupted the computation, it seems to be in the middle of this: https://github.com/JuliaOpt/Convex.jl/blob/master/src/conic_form.jl#L62-L71. Note in particular the comment that says it's time consuming.

@RoyiAvital
Copy link

I also encounter in many "Out of Memory" cases even though I have 32 [GB] of memory and I try to solve a quad form of 40000 x 40000 matrix with density of ~15%.

Is it something I should expect Convex to handle or not?
I tried both with SCS and Gurobi and I think it crashes before calling the solver.

@cossio
Copy link
Author

cossio commented Aug 28, 2021

It's been a long time since I tried this. But from what I recall, I think the issue was in the problem setup part, so not related to the solver.

@RoyiAvital
Copy link

@cossio , I agree. I think something in the problem formulation make it un happy with medium size problems.

@ericphanson
Copy link
Collaborator

I experimented with a more efficient representation in #393 and had some success at least in reducing formulation time (don’t think I measured memory consumption), but I don’t do much optimization myself anymore (and never really needed to solve big problems, just weird ones like mixed-integer SDPs or high precision SDPs), so I don’t know if I’ll ever really have time to work more on that.

Happy to provide code review and guidance if anyone wants to pick that up or try other approaches.

@ericphanson
Copy link
Collaborator

I took a look at this on the branch #504 but it didn't seem any better. On that branch,

using Convex
m = 33378 # number of rows in Adj
Adj = rand(m, 2730)
x = Variable(2730)
H = Adj * x
problem = minimize(sum(H))

# Load the problem into an MOI model
@time context = Convex.Context(problem, MOI.Utilities.Model{Float64}())

shows the problem already. Reducing m to a few thousand makes it much faster. It spends most of its time forming the sparse matrices or computing the kron products.

@ericphanson
Copy link
Collaborator

ericphanson commented Jul 16, 2023

I re-added a problem-specific conic-form cache (which I had dropped in my PR) and switched to SuiteSparseGraphBLAS for sparse arrays, did some optimization, and managed to formulate this in time

1015.189386 seconds (37.91 M allocations: 215.044 GiB, 93.00% gc time)

on 4311f19 with the script

using Convex, Clarabel, JLD2
import MathOptInterface as MOI
file = JLD2.load(joinpath("/Users/eph/Downloads", "cvx_example.jld2"))

let
    # m = 2000
    # Adj = file["Adj"][1:m, :]
    # N = file["N"][1:m, :, :]

    Adj = file["Adj"]
    N = file["N"]
    x = Variable(2730)
    S, V, T = size(N)
    lN = log.(N)
    M = vec(sum(N[:, :, 2:T]; dims=(2, 3)))
    H = Adj * x
    problem = maximize(-sum(logsumexp(lN[:, v, t] - H) for t = 1:T-1, v = 1:V) - dot(M, H), [x  -1e2, x  1e2])

    @time context = Convex.Context(problem, MOI.Utilities.Model{Float64}())
end

This is a lot better than before, but it still seems slow and obviously GC heavy. It finished on my 16 GB m1 macbook pro but the memory pressure was high for most of it.

Note also in an earlier run, I ran into GC issues where once I had tried a big run like that once, even small problems took a long time with 99% time in GC. The let block here is to try to combat that.

edit: In 8b00125 I switched back to SparseArrays to avoid some GC corruption issues and other bugs, and it regressed to

1303.736478 seconds (11.94 M allocations: 191.588 GiB, 92.49% gc time)

which is still at least "formulatable".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
6 participants