run time of app not scaling linearly with input file length #75

christakahashi · 2019-12-13T22:01:40Z

I have my input file and query file. the query file has 4 sequences (~70base) and the input file is a 60mb fasta with reads of length ~100. splitting the input file into 20 3 mb pieces and running the parasail app runs in 11seconds/piece or about three and a half minutes total. Running the full input file takes hours. I'd expect it to take the same amount of time since i'm getting the same number of comparisons at the end.

I'm running on a newish intel server with 256gb ram.

I'm invoking the parasail app from python:

     p= subprocess.Popen((parasail_path,'-d',
                         '-f',fname,
                         '-q',qname, 
                         '-b',"10",
                         '-g', base_path+"readout",
                         '-O', "SSW", 
                         '-a', "sg_trace_striped_16", 
                         "-t", "1",
                         "-o","8",
                         "-c",cutoff,
                         "-e",gapextend ),
                        stdout=subprocess.PIPE,
                        stderr=subprocess.PIPE)
    p.wait()

The text was updated successfully, but these errors were encountered:

jeffdaily · 2019-12-16T18:03:38Z

Would you be able to share your input files so I can try and reproduce exactly?

christakahashi · 2019-12-17T00:14:36Z

I put synthetic data here: ftp://ftp.cs.washington.edu/outgoing/fasta_files.zip that seems to behave in the same way. fake.nn.fasta are the file fake_full.fasta split into 20 parts.

heres the query file ftp://ftp.cs.washington.edu/outgoing/query.fasta

i'm using the invokation above with cutoff = 8 gapextend = 0

(note the above links will probably only work for the next 90 days)

jeffdaily · 2019-12-17T21:46:32Z

Thank you very much for the sample inputs. It was very easy to reproduce, but I'm still digging for an answer. When I enabled the verbose output -v I noticed that the ESA time was exploding. That's the enhanced suffix array that is used for filtering out pairs based on an exact-match cutoff. Your cutoff of 8 seems to not exclude any sequences, at least for the sample data you provided. In that case, you can skip the whole ESA filtering step if you use -x.

In the meantime, I'm still looking into why the ESA step was so terrible.

jeffdaily · 2019-12-17T22:49:28Z

I don't have a great answer for you, but it seems the ESA step is just plain inefficient with your particular inputs and selected cutoff. I seem to have more success with it when aligning sequences from a larger alphabet, e.g., proteins.

I will suggest again that in your particular case you don't need this filter at all, so just disable it with -x.

christakahashi · 2019-12-17T22:59:39Z

yeah that's an artifact of the lazy way I made they synthetic data set. my real set doesn't always match. I have permission to share (privately) the real data set if you think it helps.

For now I can keep splitting the file into smaller parts or run -x and filter later in my pipeline fairly cheaply.

Thanks for taking a look!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run time of app not scaling linearly with input file length #75

run time of app not scaling linearly with input file length #75

christakahashi commented Dec 13, 2019 •

edited

Loading

jeffdaily commented Dec 16, 2019

christakahashi commented Dec 17, 2019 •

edited

Loading

jeffdaily commented Dec 17, 2019

jeffdaily commented Dec 17, 2019

christakahashi commented Dec 17, 2019

run time of app not scaling linearly with input file length #75

run time of app not scaling linearly with input file length #75

Comments

christakahashi commented Dec 13, 2019 • edited Loading

jeffdaily commented Dec 16, 2019

christakahashi commented Dec 17, 2019 • edited Loading

jeffdaily commented Dec 17, 2019

jeffdaily commented Dec 17, 2019

christakahashi commented Dec 17, 2019

christakahashi commented Dec 13, 2019 •

edited

Loading

christakahashi commented Dec 17, 2019 •

edited

Loading