Filtering Data¶

Data can be filtered arbitrarily using pd.DataFrame methods but the hicutils.core.filters module provides helper utilities for common filtering routines. Examples include filtering non-productive clones and excluding clones by copy number cutoffs.

Examples¶

filtering

Filtering by functionality and copies¶

This example shows how to remove non-functional clones and any clone with less than 5 copies across all samples in the subject.

In [1]:

import hicutils as hu

df = hu.io.read_directory('example_data_immunedb')
filtered_df = (
    df
    .pipe(hu.filters.filter_functional)
    .pipe(hu.filters.filter_by_overall_copies, 5)
)
display('Total Functional, 5+ Copy Clones:',
        filtered_df.groupby('subject').clone_id.nunique())

'Total Functional, 5+ Copy Clones:'

subject
HPAP015    1951
HPAP017    2391
Name: clone_id, dtype: int64

Filtering clones based on presence in replicates¶

The following examples show different ways of filtering clones based on copies and whether or not they're found in certain replicates.

In this first example, clones in less than two replicates are excluded.

In [2]:

df = hu.io.read_directory('example_data_immunedb')
pdf = hu.filters.filter_number_of_pools(df, 'replicate_name', 2)
display(f'There are {pdf.clone_id.nunique()} clones in any two or more replicates')

'There are 487 clones in any two or more replicates'

It is also possible to limit the pools (replicates in this case) to check for overlap. For example, this code snippet looks for clones that are in both of the HPAP015 replicates.

In [3]:

limit_reps = [
    'IgH_HPAP015_rep1_200p0ng', 'IgH_HPAP015_rep2_200p0ng'
]
pdf = hu.filters.filter_number_of_pools(df, 'replicate_name', 2, limit_to=limit_reps)
display(f'There are {pdf.clone_id.nunique()} clones in both '
        f'replicates {", ".join(limit_reps)}')

'There are 380 clones in both replicates IgH_HPAP015_rep1_200p0ng, IgH_HPAP015_rep2_200p0ng'

This example below removes any clone found in IgH_HPAP015_rep1_200p0ng

In [4]:

pdf = hu.filters.filter_number_of_pools(
    df[df.subject == 'HPAP015'],
    'replicate_name',
    0,
    limit_to=['IgH_HPAP015_rep1_200p0ng']
)
display(f'There are {pdf.clone_id.nunique()} HPAP015 clones '
        'NOT in IgH_HPAP015_rep1_200p0ng')

'There are 19620 HPAP015 clones NOT in IgH_HPAP015_rep1_200p0ng'

Filtering on gene frequency¶

It is possible to filter clones based on the overall gene frequency. The example below filters out clones with a V-gene that is utilized at less than 0.5% in the associated donor.

In [10]:

pdf = hu.filters.filter_by_gene_frequency(df, 0.005)
display(f'There are {len(df)} total clones and {len(pdf)} after filtering')

'There are 40000 total clones and 39291 after filtering'

The example below filters out clones with a V-gene that is utilized at less than 1% in the associated disease group.

In [15]:

pdf = hu.filters.filter_by_gene_frequency(df, 0.01, by='METADATA_disease')
display(f'There are {len(df)} total clones and {len(pdf)} after filtering')

'There are 40000 total clones and 39183 after filtering'

API Documentation¶

hicutils.core.filters.filter_by_gene_frequency(df, min_frequency, by='subject', gene='v_gene')¶

Removes clones in by (defaults to subject) which have an overall gene usage less than or equal to min_frequency.

For example, if min_frequency=0.05 and by='subject', all clones using a V-gene with a frequency less than or equal to 0.05 in a given subject are removed.

dfpd.DataFrame: The DataFrame to filter.
min_frequencyfloat: The minimum frequency of a gene in by that should be included.
bystr: The column on which to calculate frequency. Defaults to subject.
genestr: The gene on which to filter. Accepts v_gene or j_gene defaulting to j_gene

Returns¶

DataFrame filtered on gene frequency in by.

hicutils.core.filters.filter_by_overall_copies(df, copies, field='clone_id')¶

Removes clones identified by field (default clone_id) from a DataFrame with less than copies total copies across all pools.

Changing field changes the definition of a clone. For example, setting field to 'cdr3_aa' will defined clones by their CDR3 AA sequence.

Parameters¶

dfpd.DataFrame: The DataFrame to filter.
copiesint: The minimum copy number of each clone required to be included in the resulting DataFrame.

Returns¶

DataFrame filtered by copies.

Examples¶

The following removes all clones with less than 5 copies from df:

>>> df.copies.min()
1
>>> df = filter_by_overall_copies(df, 5)
>>> df.copies.min()
4

hicutils.core.filters.filter_by_presence(df, pool, pool_value)¶

Filters clones based on presence in a given pool.

dfpd.DataFrame: The DataFrame to filter.
poolstr: The pool on which to filter.
pool_valuestr: The pool value on which to filter.

Returns¶

DataFrame filtered by number of pools.

hicutils.core.filters.filter_functional(df, functional=True)¶

Removes clones on functionality, by default removing non-functional clones. Setting functionality to False removes functional clones.

Parameters¶

dfpd.DataFrame: The DataFrame to filter.
functionalbool: The functionality of the clones to include. Set to True (the default) to include functional clones only. Set to False to only include non-functional clones.

Returns¶

DataFrame filtered by functionality.

hicutils.core.filters.filter_number_of_pools(df, pool, n, func='greater_equal', limit_to=None)¶

Filters clones based on the number of pools in which it occurs. df : pd.DataFrame

The DataFrame to filter.

poolstr: The pool on which to filter.
nstr: The number of distinct pools a clone must be in to be included in the resulting DataFrame.
funcfunction: The comparison function to use between n and the number of occurrences of each clone. The default is greater_equal meaning a clone must occur in ≥ n pools to be included. Any numpy function may be used such as equal or less_equal.
limit_tolist(str), str, None: If specified, overlap will be limited to the specified pools. This is useful to filter clones based on their overlap in a subset of pools.

Returns¶

DataFrame filtered by number of pools.

hicutils.core.filters.remove_potential_contaminates(df, pool, pool_values, clone_feature='cdr3_nt')¶

Removes clones based on clone_feature (defaults to CDR3 NT) which occur in pool with values pool_values. For example, to remove all clones with CDR3 NT sequences found in subjects ‘Fibroblast’ and ‘Water’:

remove_potential_contaminates(df, 'subject', ['Fibroblast', 'Water'])

dfpd.DataFrame: The DataFrame to filter.
poolstr: The pool to use for filtering.
pool_valueslist: The values of pool which should be the basis of clonal exclusion.
clone_featurestr: The clone feature to use for filtering. For example cdr3_nt (the default) will use the CDR3 NT sequence as the basis for removing other clones.

Returns¶

DataFrame with clones occurring in pool with values pool_values excluded on the basis of clone_feature.

Filtering Data¶

Examples¶

Filtering by functionality and copies¶

Filtering clones based on presence in replicates¶

Filtering on gene frequency¶

API Documentation¶

Returns¶

Parameters¶

Returns¶

Examples¶

Returns¶

Parameters¶

Returns¶

Returns¶

Returns¶

hicutils

Navigation

Related Topics