-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fragment-level analysis #68
Comments
As an alternative to all this, provide a way to easily plot pair frequency by distance by read direction? |
Here's an example using the dask-adapter for pairix in bioframe. It certainly doesn't require either dask or pairix, but would need to be accumulated chunkwise. The functions to compute "contact areas" and pairs scalings were recently moved to cooltools, but they are quite short and reusable: https://github.com/mirnylab/cooltools/blob/master/cooltools/expected.py#L53 |
I think we just need to tweak the output of stats a bit so that we don't don't need to do anything on the raw pairs - it already saves number of contacts at different distance with different orientations. I can make a plot like this from stats: I suppose, it lacks normalisation over the areas for each bin... Also, if there are any ideas what went wrong with the "Bad" sample, I would be happy to hear your thoughts... |
OK, using @nvictus 's suggestion I can quite easily make this using the areas functions from cooltools and counts from stats. Still not sure how to interpret the difference between samples, though (the bad one was not sequenced deeply, so the curves are not smooth). Code, just in case it's useful. def get_areas(d, chroms=pd.read_table('/home/s1529682/eddie/ilia/genomes/mm9/chrfile.txt', names=['chrom', 'length'])):
bins = np.append(np.unique(d['Start'].values), [np.inf])
areas = np.zeros(len(bins)-1)
for i in range(len(chroms)):
areas = areas + expected.contact_areas(bins, (0, chroms['length'][i]), (0, chroms['length'][i]))
return areas
def get_scaling_by_orienation(stats):
d = stats.loc[stats.index.str.startswith('dist_freq')]
d = d.reset_index()
d[['Distance', 'Orientation']] = d['index'].str.split('/', expand=True).iloc[:, 1:]
d = d.drop('index', axis=1)
d[['Start', 'End']] = d['Distance'].str.split('-', expand=True).apply(lambda x: [int(i.replace('+', '')) if i is not None else np.inf for i in x])
d['Distance'] = np.exp((np.log(d['Start'])+np.log(d['End']))/2)
d = d[['Orientation', 'Distance', 'Start', 'End', 'Value']]
return d
def plot_scaling_by_orienation(stats, ax):
d = get_scaling_by_orienation(stats)
areas = get_areas(d)
for orientation in '+-', '-+', '++', '--':
sd = d[d['Orientation']==orientation]
sd['Value']/= areas
ax.plot(sd['Distance'], sd['Value']/d['Value'].sum(), label=orientation) |
Are they the same cell type etc? If so, there may be many ligation in trans
in the "bad" dataset, by looking at how the P(s) bends more flat around
10^6. If this is the case & you plot the average trans level normalized
appropriately, you can see the cis reaching this "noise floor", as we show
in MicroC-XL supFig3 (
https://media.nature.com/original/nature-assets/nmeth/journal/v13/n12/extref/nmeth.4025-S1.pdf
)
…On Sat, May 19, 2018, 4:29 AM Ilya Flyamer ***@***.***> wrote:
OK, using @nvictus <https://github.com/nvictus> 's suggestion I can quite
easily make this using the areas functions from cooltools and counts from
stats.
[image: image]
<https://user-images.githubusercontent.com/2895034/40268105-8082e3d8-5b5f-11e8-97ba-84f90190d944.png>
Still not sure how to interpret the difference between samples, though
(the bad one was not sequenced deeply, so I suppose the lack of smoothness
is expected).
Code, just in case it's useful.
def get_areas(d, chroms=pd.read_table('/home/s1529682/eddie/ilia/genomes/mm9/chrfile.txt', names=['chrom', 'length'])):
bins = np.append(np.unique(d['Start'].values), [np.inf])
areas = np.zeros(len(bins)-1)
for i in range(len(chroms)):
areas = areas + expected.contact_areas(bins, (0, chroms['length'][i]), (0, chroms['length'][i]))
return areas
def get_scaling_by_orienation(stats):
d = stats.loc[stats.index.str.startswith('dist_freq')]
d = d.reset_index()
d[['Distance', 'Orientation']] = d['index'].str.split('/', expand=True).iloc[:, 1:]
d = d.drop('index', axis=1)
d[['Start', 'End']] = d['Distance'].str.split('-', expand=True).apply(lambda x: [int(i.replace('+', '')) if i is not None else np.inf for i in x])
d['Distance'] = np.exp((np.log(d['Start'])+np.log(d['End']))/2)
d = d[['Orientation', 'Distance', 'Start', 'End', 'Value']]
return d
def plot_scaling_by_orienation(stats, ax):
d = get_scaling_by_orienation(stats)
areas = get_areas(d)
for orientation in '+-', '-+', '++', '--':
sd = d[d['Orientation']==orientation]
sd['Value']/= areas
ax.plot(sd['Distance'], sd['Value'], label=orientation)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390398720>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ASvzZbzd68_JwlOj82Ail5TqhMb4ICxRks5t0AIpgaJpZM4UFGjp>
.
|
Thank @gfudenberg ! |
On Sun, May 20, 2018 at 10:17 AM, Ilya Flyamer ***@***.***> wrote:
Thank @gfudenberg <https://github.com/gfudenberg> !
Yes, same cell type.
Yeah, there is a lot of trans (60-70%), and yeah, scaling is flat on the
right, clearly a lot of random ligations or something like that - what I
mean is there are also differences on the left side, and I thought maybe
they can hint at what exactly is going wrong? Because I just have no clue
what is going wrong now...
Hmm, I don't know if anyone has systematically looked into different
failure modes & how they relate to short range orientation P(s). In part
because I'd guess that a bunch of experiments that didn't work well never
make it to GEO...
I've never seen a bump at 30kb before, so if that's real (and not just e.g.
noise from not sequencing deeply enough) my first guess would be something
went wrong with the restriction step, leaving a bunch of large fragments
which then have trouble ligating. Anything funky with the size of 3C
products on the gel?
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390497243>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ASvzZZaSH5A3s83S1bbsZnAvZ8F4nlNeks5t0aVBgaJpZM4UFGjp>
.
|
On Sun, May 20, 2018 at 10:53 AM, Ilya Flyamer ***@***.***> wrote:
I realize that, I just thought since you guys have been analyzing other
people's data maybe you encountered something like that.
Do you mean 3 kb?
Ah, yes, meant 3kb
Digestion works fine though... Size shift after Hi-C with blunt-end
ligation is hard to see on the gel sometimes, so last week I just did 3C
same way as I do Hi-C to check that the reagents are working fine, and the
gel looks great.
I see... yes, it would be interesting to systematically look at cis/total
ratio vs. various short-distance orientation statistics (e.g. ratio of
self-circles to dangling ends, etc.) to see if there's some useful
troubleshooting signals.
|
There's an enrichment in the supposed "self-circle" fraction < 10kb with a strange "double hump". I'm not sure how unusual it is, but I would have guessed some issue with fill-in or ligation. Your gel suggests that ligation is not the problem though. Maybe it's downstream stuff like library prep or PCR. Have you looked at @gibcus's bootcamp slides? There might be some QC ideas to try there, like digesting PCR products. He might be able to chime in too. |
@gfudenberg you have a good eye! The small peak at 3kb is real and present in a few other new bad data, but never in the old good ones. Here is an extremely bad dataset where it's really obvious: And by looking at a few of them the height of the peak clearly correlates with how flat the scaling is on the right end! WTF is this?! |
wow, this looks so weird!!
Btw, what are the mapping stats, i.e. fraction of duplicates, unmapped and
multimappers? Does it differ much from the "good" libraries?
…On 21 May 2018 at 09:53, Ilya Flyamer ***@***.***> wrote:
@gfudenberg <https://github.com/gfudenberg> you have a good eye! The
small peak at 3kb is real and present in a few other new bad data, but
never in the old good ones. Here is an extremely bad dataset where it's
really obvious:
[image: image]
<https://user-images.githubusercontent.com/2895034/40310980-6c4f5c04-5d06-11e8-9724-8ef35e88ee4b.png>
And by looking at a few of them the height of the peak clearly correlates
with how flat the scaling is on the right end! WTF is this?!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390660044>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA3uCgeCgVltpuhg_l8iNHj43zaEdJD0ks5t0sbIgaJpZM4UFGjp>
.
|
huh! that's very concerning, actually! This suggests that the issue might
be downstream of ligation, in sequencing.
Did you check fastqc of the "bad" samples? Specifically, I'd check if you
still have adapters
…On 21 May 2018 at 10:32, Ilya Flyamer ***@***.***> wrote:
Yeah, there are differences - values as a fraction of total reads.
[image: image]
<https://user-images.githubusercontent.com/2895034/40312765-b8848716-5d0b-11e8-84f3-a5c40b26709c.png>
[image: image]
<https://user-images.githubusercontent.com/2895034/40312930-2365c96e-5d0c-11e8-9aea-8d093d18b6ca.png>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390671911>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA3uChhip7a_N__ipzbNMVXvopqYsjPHks5t0tARgaJpZM4UFGjp>
.
|
No, I haven't checked, actually... Why do you think it might be adaptors? distiller can do that for me, right, if I specify do_fastqc: True? |
it's just a guess, thinking of what could bump multimappers and nulls...
yes, do_fastqc does fastqc :)
…On 21 May 2018 at 10:40, Ilya Flyamer ***@***.***> wrote:
No, I haven't checked, actually... Why do you think it might be adaptors?
distiller can do that for me, right, if I specify do_fastqc: True?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390674234>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA3uCgKYM02GOkPNYb0lQeYkqjc-zKUqks5t0tHdgaJpZM4UFGjp>
.
|
No adaptors, but some weird stuff is going on!!! Unequal level of A and T across the whole read! And other things... |
also, an intense GC bias...
…On 21 May 2018 at 10:52, Ilya Flyamer ***@***.***> wrote:
No adaptors, but some weird stuff is going on!!! Unequal level of A and T
across the whole read! And other things...
If you want to have a look, here is a zip with the output, serum-RI3t is
good, RI4t is bad (the first plot here on top is from these).
fastqc.zip
<https://github.com/mirnylab/pairsamtools/files/2023009/fastqc.zip>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390678179>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA3uCkSQ2XzZWjLaWC6ytKbzM129aMrtks5t0tTJgaJpZM4UFGjp>
.
|
(I have to say, they were sequenced on different machines) |
Also GATC signature in the beginning is way more pronounced. |
So no ideas from this?.. |
any chance you can bug the sequencing facility? This _strongly_ seems like
an issue of sequencing, not HiC
…On 21 May 2018 at 13:13, Ilya Flyamer ***@***.***> wrote:
So no ideas from this?..
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390720024>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA3uCt4qPihzg6KAvd6uHoAk2CQQeHSoks5t0vWygaJpZM4UFGjp>
.
|
If you mean there is something wrong on the technical side with the sequencing, some bad samples were sequenced in one place, and some - in another... But I might ask them for their opinion on what could cause such issues though. |
hm, i'm confused now : (a) are there "bad" samples that don't have fastqc
issues? (b) are there "good" samples that were sequenced in the same
facility as the "bad" ones?
…On 21 May 2018 at 15:10, Ilya Flyamer ***@***.***> wrote:
If you mean there is something wrong on the technical side with the
sequencing, some samples were sequenced in one place, and some - in
another...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390752587>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA3uCgCnZAx4M1eCTLAuNPRQ1HtO1ef9ks5t0xESgaJpZM4UFGjp>
.
|
a) I haven't run fastqc on all of them I can try to compare them more comprehensively tomorrow |
my point is that all observations you showed so far are consistent with a
single issue behind "bad" samples - faulty sequencing; is there any piece
of evidence that contradicts this theory?
On 21 May 2018 at 15:14, Anton Goloborodko <goloborodko.anton@gmail.com>
wrote:
… hm, i'm confused now : (a) are there "bad" samples that don't have fastqc
issues? (b) are there "good" samples that were sequenced in the same
facility as the "bad" ones?
On 21 May 2018 at 15:10, Ilya Flyamer ***@***.***> wrote:
> If you mean there is something wrong on the technical side with the
> sequencing, some samples were sequenced in one place, and some - in
> another...
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390752587>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AA3uCgCnZAx4M1eCTLAuNPRQ1HtO1ef9ks5t0xESgaJpZM4UFGjp>
> .
>
|
Both good and bad data have been produced in two different facilities, and at one point a few months ago all libraries, wherever we sequence them, started having similar issues. |
and what about the A-vs-T bias, is it common for all "bad" datasets?
…On 22 May 2018 at 08:35, Ilya Flyamer ***@***.***> wrote:
So, I ran fastqs for many of the good and bad libraries, and the only
mostly consistent problem with the bad ones is very high
over-representation of a particular group of K-mers in a particular place
in the forward reads. This sequence looks like a part of the adaptor
sequence, why it is so often found in a particular place, and only in the
forward read is completely enigmatic to me, but their actual counts are
pretty low and overall adaptor sequences are not very frequent, so I don't
think this can be the source of these problems.
[image: image]
<https://user-images.githubusercontent.com/2895034/40361958-fe7962ca-5dc2-11e8-8e1c-2820c0213a36.png>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390973589>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA3uCt1YuHI8rHQOD276hcPi-w7EUXfHks5t1AX1gaJpZM4UFGjp>
.
|
And yeah, similar effect in G vs C in reverse reads is also present. |
I believe all of this is sufficiently concerning to contact the sequencing
facility - you paid a lot of money for it, after all.
Is the amount of the T->A and C->G bias correlated with how "bad" the
library is?
To conclude, do you think that the issues of the "bad" libraries can be
entirely related to sequencing (and not Hi-C) or there are some
observations against that theory?
…On 22 May 2018 at 14:48, Ilya Flyamer ***@***.***> wrote:
Well, it seems it's quite commonly present in reverse reads, but not in
forward reads, whatever that means, and however that is possible... Here is
an example.
[image: image]
<https://user-images.githubusercontent.com/2895034/40383537-094abc18-5df9-11e8-9e94-94eb53eade8b.png>
[image: image]
<https://user-images.githubusercontent.com/2895034/40383552-11b11e2e-5df9-11e8-9ce7-4c68943f927a.png>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-391100421>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA3uCiM-1aRjB7-hx_u4nqhgGZ_NPVEvks5t1F2BgaJpZM4UFGjp>
.
|
I will check that. Well, as I said, I think it's unlikely that two facilities at the same time stop producing good data with similar symptoms... |
It seems like quality (scaling shallowness) is correlated to this bias, yes. But hard to say definitively. |
Ah, great point. Btw, your multimappers went through the roof - more than a
half of the sample. Maybe you can check them manually to see what's wrong
with them?
…On Tue, May 22, 2018, 15:33 Ilya Flyamer ***@***.***> wrote:
It seems like quality (scaling shallowness) is correlated to this bias,
yes. But hard to say definitively.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-391113862>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA3uCuD5nLm5WV9JRqTs-6QC18S-4oyAks5t1GgkgaJpZM4UFGjp>
.
|
And actually I was wrong when I said that it's not present in forward reads - reverse reads are certainly more affected, but in at least one run most samples have it in forward reads too to a different extent - which seems to correlate with just how bad they are. |
Yeah, I noticed that too, and it's present (again, to a different extent) in a few samples I checked, but not sure how to look at them - like, what exactly should I look into? |
Also, how about we switch to e.g. G-chat @golobor ? I doubt everyone is interested in this! |
Hi, |
Discussion to address open2c/distiller-nf#96.
Since there is
pairsamtools restrict
, we can use that as a starting point for adding fragment-level stats. But after some thinking, it seems more complicated to implement properly than I thought.So, how about we add another column (
pair_frags_type
?) to the pairs file with the classification of the pairs according to this. And then maybe the actual counting can be a part of the regular stats? Then indistiller
restrict
should be called beforestats
, if we include it there (and I think it should be there since people expect this kind of stats)?Here is a list of the stats that need to be implemented, basing off of
hiclib
. What did I forget?hiclib
, but seems logical to include it)hiclib
also has a category "error" here, what does that mean? @mimakaevhiclib
?The text was updated successfully, but these errors were encountered: