Filtering Data¶
Data can be filtered arbitrarily using pd.DataFrame methods but the
hicutils.core.filters module provides helper utilities for common filtering
routines. Examples include filtering non-productive clones and excluding
clones by copy number cutoffs.
Examples¶
Filtering by functionality and copies¶
This example shows how to remove non-functional clones and any clone with less than 5 copies across all samples in the subject.
import hicutils as hu
df = hu.io.read_directory('example_data_immunedb')
filtered_df = (
df
.pipe(hu.filters.filter_functional)
.pipe(hu.filters.filter_by_overall_copies, 5)
)
display('Total Functional, 5+ Copy Clones:',
filtered_df.groupby('subject').clone_id.nunique())
'Total Functional, 5+ Copy Clones:'
subject HPAP015 1951 HPAP017 2391 Name: clone_id, dtype: int64
Filtering clones based on presence in replicates¶
The following examples show different ways of filtering clones based on copies and whether or not they're found in certain replicates.
In this first example, clones in less than two replicates are excluded.
df = hu.io.read_directory('example_data_immunedb')
pdf = hu.filters.filter_number_of_pools(df, 'replicate_name', 2)
display(f'There are {pdf.clone_id.nunique()} clones in any two or more replicates')
'There are 487 clones in any two or more replicates'
It is also possible to limit the pools (replicates in this case) to check for overlap. For example, this code snippet looks for clones that are in both of the HPAP015 replicates.
limit_reps = [
'IgH_HPAP015_rep1_200p0ng', 'IgH_HPAP015_rep2_200p0ng'
]
pdf = hu.filters.filter_number_of_pools(df, 'replicate_name', 2, limit_to=limit_reps)
display(f'There are {pdf.clone_id.nunique()} clones in both '
f'replicates {", ".join(limit_reps)}')
'There are 380 clones in both replicates IgH_HPAP015_rep1_200p0ng, IgH_HPAP015_rep2_200p0ng'
This example below removes any clone found in IgH_HPAP015_rep1_200p0ng
pdf = hu.filters.filter_number_of_pools(
df[df.subject == 'HPAP015'],
'replicate_name',
0,
limit_to=['IgH_HPAP015_rep1_200p0ng']
)
display(f'There are {pdf.clone_id.nunique()} HPAP015 clones '
'NOT in IgH_HPAP015_rep1_200p0ng')
'There are 19620 HPAP015 clones NOT in IgH_HPAP015_rep1_200p0ng'
Filtering on gene frequency¶
It is possible to filter clones based on the overall gene frequency. The example below filters out clones with a V-gene that is utilized at less than 0.5% in the associated donor.
pdf = hu.filters.filter_by_gene_frequency(df, 0.005)
display(f'There are {len(df)} total clones and {len(pdf)} after filtering')
'There are 40000 total clones and 39291 after filtering'
The example below filters out clones with a V-gene that is utilized at less than 1% in the associated disease group.
pdf = hu.filters.filter_by_gene_frequency(df, 0.01, by='METADATA_disease')
display(f'There are {len(df)} total clones and {len(pdf)} after filtering')
'There are 40000 total clones and 39183 after filtering'
API Documentation¶
- hicutils.core.filters.filter_by_gene_frequency(df, min_frequency, by='subject', gene='v_gene')¶
Removes clones in
by(defaults tosubject) which have an overallgeneusage less than or equal tomin_frequency.For example, if
min_frequency=0.05andby='subject', all clones using a V-gene with a frequency less than or equal to 0.05 in a given subject are removed.- dfpd.DataFrame
The DataFrame to filter.
- min_frequencyfloat
The minimum frequency of a gene in
bythat should be included.- bystr
The column on which to calculate frequency. Defaults to
subject.- genestr
The gene on which to filter. Accepts
v_geneorj_genedefaulting toj_gene
Returns¶
DataFrame filtered on gene frequency in
by.
- hicutils.core.filters.filter_by_overall_copies(df, copies, field='clone_id')¶
Removes clones identified by
field(defaultclone_id) from a DataFrame with less thancopiestotal copies across all pools.Changing
fieldchanges the definition of a clone. For example, settingfieldto'cdr3_aa'will defined clones by their CDR3 AA sequence.Parameters¶
- dfpd.DataFrame
The DataFrame to filter.
- copiesint
The minimum copy number of each clone required to be included in the resulting DataFrame.
Returns¶
DataFrame filtered by copies.
Examples¶
The following removes all clones with less than 5 copies from
df:>>> df.copies.min() 1 >>> df = filter_by_overall_copies(df, 5) >>> df.copies.min() 4
- hicutils.core.filters.filter_by_presence(df, pool, pool_value)¶
Filters clones based on presence in a given pool.
- dfpd.DataFrame
The DataFrame to filter.
- poolstr
The pool on which to filter.
- pool_valuestr
The pool value on which to filter.
Returns¶
DataFrame filtered by number of pools.
- hicutils.core.filters.filter_functional(df, functional=True)¶
Removes clones on functionality, by default removing non-functional clones. Setting
functionalitytoFalseremoves functional clones.Parameters¶
- dfpd.DataFrame
The DataFrame to filter.
- functionalbool
The functionality of the clones to include. Set to
True(the default) to include functional clones only. Set toFalseto only include non-functional clones.
Returns¶
DataFrame filtered by functionality.
- hicutils.core.filters.filter_number_of_pools(df, pool, n, func='greater_equal', limit_to=None)¶
Filters clones based on the number of pools in which it occurs. df : pd.DataFrame
The DataFrame to filter.
- poolstr
The pool on which to filter.
- nstr
The number of distinct pools a clone must be in to be included in the resulting DataFrame.
- funcfunction
The comparison function to use between n and the number of occurrences of each clone. The default is greater_equal meaning a clone must occur in ≥ n pools to be included. Any numpy function may be used such as equal or less_equal.
- limit_tolist(str), str, None
If specified, overlap will be limited to the specified pools. This is useful to filter clones based on their overlap in a subset of pools.
Returns¶
DataFrame filtered by number of pools.
- hicutils.core.filters.remove_potential_contaminates(df, pool, pool_values, clone_feature='cdr3_nt')¶
Removes clones based on
clone_feature(defaults to CDR3 NT) which occur inpoolwith valuespool_values. For example, to remove all clones with CDR3 NT sequences found in subjects ‘Fibroblast’ and ‘Water’:remove_potential_contaminates(df, 'subject', ['Fibroblast', 'Water'])
- dfpd.DataFrame
The DataFrame to filter.
- poolstr
The pool to use for filtering.
- pool_valueslist
The values of
poolwhich should be the basis of clonal exclusion.- clone_featurestr
The clone feature to use for filtering. For example
cdr3_nt(the default) will use the CDR3 NT sequence as the basis for removing other clones.
Returns¶
DataFrame with clones occurring in
poolwith valuespool_valuesexcluded on the basis ofclone_feature.