Plotting

The hicutils.plotting module provides all plotting functions. Each plotting function returns both a handle to the underlying figure as well as the pd.DataFrame which was used to create the plot.

Clone Size

A variety of clone size plots are provide to visualize the overall clonal landscape of a dataset.

clone_size
hicutils.plots.clone_size.plot_clone_counts(df, pool, **kwargs)

Plots the number of clones per pool.

Parameters

dfpd.DataFrame

The DataFrame used to plot the clone size distribution.

poolstr

The field on which to pool.

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying DataFrame.

hicutils.plots.clone_size.plot_clone_sizes(df, cutoff=None, **kwargs)

Plots the distribution of clone sizes in df.

Parameters

dfpd.DataFrame

The DataFrame used to plot the clone size distribution.

cutoffint or None

Aggregate all clones with cutoff or more copies into one bin on the right side of the graph. This is useful to condense the tail of the plotted distribution.

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying DataFrame.

hicutils.plots.clone_size.plot_d_index(df, pool, cutoff=20, **kwargs)

Plots the Dx index for clones in df. The default cutoff value is 20 and the generated figure is a dot plot of Dx values stratified by pool.

Parameters

dfpd.DataFrame

The DataFrame used to plot the top clones.

cutoffint

The D-value to use as a cutoff, defaults to 20.

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying DataFrame.

hicutils.plots.clone_size.plot_top_clones(df, cutoff=20, annotate=False, color=(0.8392156862745098, 0.15294117647058825, 0.1568627450980392), figsize=(12, 8))

Plots the copy-number frequency of the top cutoff clones (default 20). Optionally, the annotate keyword can be set to one or more clone features to annotate each bar. For example setting annotate=('v_gene', 'cdr3_aa') will show the V-gene and CDR3 AA for each clone.

Parameters

dfpd.DataFrame

The DataFrame used to plot the top clones.

cutoffint

The number of clones to plot, defaults to 20.

annotatestr, list, or None

The feature(s) to annotate for each clone

colorstr

The color to use for bars.

figsizetuple

The (width, height) of the plot.

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying DataFrame.

Gene Usage

The gene usage plots show V- or J-gene usage grouped by pool. This can be useful for investigating gene skewing in different populations. Each plot can be scaled in various ways and clustered by row, column, both, or neither.

gene_usage
hicutils.plots.gene_usage.plot_gene_frequency(df, pool, gene, size_metric='clones', by=None, **kwargs)

Generates a gene-usage dot/bar plot showing the utilization of each V or J gene based on pools.

Parameters

dfpd.DataFrame

The DataFrame to use as the source of gene usage information.

poolstr

The pooling column to use for each row of the heatmap.

genestr (v_gene or j_gene)

The gene to plot. Must be either v_gene or j_gene.

size_metricstr

The size metric which is plotted as the intensity of each cell. Must be one of clones, copies, or uniques.

bystr

The feature to use as the hue variable for the plot. Must be included in the pool parameter.

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying DataFrame.

hicutils.plots.gene_usage.plot_gene_heatmap(df, pool, gene, min_frequency=0, size_metric='clones', normalize_by='rows', cluster_by='both', **kwargs)

Generates a gene-usage heatmap showing the utilization of each V or J gene based on pools.

Parameters

dfpd.DataFrame

The DataFrame to use as the source of gene usage information.

poolstr

The pooling column to use for each row of the heatmap.

genestr (v_gene or j_gene)

The gene to plot. Must be either v_gene or j_gene.

min_frequencyfloat

The minimum frequency across all pools allowed to be included in the heatmap.

size_metricstr

The size metric which is plotted as the intensity of each cell. Must be one of clones, copies, or uniques.

normalize_bystr

Sets how to normalize the plot. If set to rows (the default) each row is normalized to sum to one. Setting it to cols causes each column (gene) to sum to one.

cluster_bystr (rows, cols, or both) or None

Sets which clustering to display. Valid values are rows, cols, both, or clustering can be disabled with None.

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying DataFrame.

Clonal Overlap

Clonal overlap can be visualized by string plots with the plot_strings function or as UpSet plots with the plot_upset function.

For string plots, each row represents a clone and each column a pool. The frequency of a given clone in a pool can be indicated by the intensity of the corresponding cell if desired. Further, the definition of a clone (defaulting to clone_id) can be modified by the overlapping_features parameter. For example, to track clonal CDR3 amino-acids rather than clone_id, one can specify overlapping_features=['cdr3_aa'].

UpSet plots are an extension of Venn diagrams to show large numbers of categories. These can be plotted using the plot_upset function and are highly configurable.

See the API documents to see all parameters for these functions.

overlap
hicutils.plots.overlap.plot_similarity_heatmap(df, pool, dist_func_name, clone_features='clone_id', cutoff_func=None, **kwargs)

Generates an UpSet plot of clonal data. The UpSet plot may be scaled by clones or copies with size and the definition of a clone can be varied with the clone_features parameter. Further, distributions of other variables such as cdr3_num_nts and shm can be placed above each intersection bar with subplots.

Parameters

dfpd.DataFrame

The DataFrame to use as the source of clonal overlap information.

poolstr

How to pool the clones to calculate similarity

dist_func_namefunction

Function to use for similarity calculation. Accepts jaccard or cosine.

clone_featureslist(str)

The feature(s) to use for clone definition. The default clone_id uses the clone definitions in df. This can be altered to any other columns in the DataFrame such as cdr3_aa.

cutoff_funcfunc(df) -> float

A function returning a cutoff to designate the maximum value in the DataFrame. All values greater than or equal to the returned value are remapped to the returned value.

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying overlap DataFrame.

hicutils.plots.overlap.plot_strings(df, pool, only_overlapping=True, overlapping_features=('clone_id', 'cdr3_aa', 'v_gene', 'j_gene'), scale=False, limit=None, ylabels='counts', col_order=None, row_order=None, order=None, pivot_hook=None, col_namer=<function <lambda>>, highlight=None, **kwargs)

Creates an overlap string plot where each row represents a clone and each column represents a pool. Among other features, the definition of a clone can be modified and the heatmap can be boolean or scaled to the number of copies a clone comprises in each pool.

Parameters

dfpd.DataFrame

The DataFrame to use for tracking clones.

poolstr

The column to use for pooling clones into columns.

only_overlappingbool

If set to True (the default), only clones overlapping at least two pools will be included in the overlap plot.

overlapping_featureslist

The feature(s) to use to track clones across pools. By default the clone_id value is used. To alter this behavior, this value can be changed to any clonal information field such as cdr3_aa, v_gene, j_gene, and cdr3_nt.

This is particularly useful to track clones across donors where the clone_id will differ but the cdr3_aa can be used instead.

scalebool or log

If scale=False (the default) presence of a clone in a pool is indicated by blue and absence by gray. When scale=True the color of each clone/pool indicates the total number of copies. Setting scale='log' changes the scale to be the log10 of copies.

limitint or None

If set to an integer n, limits the number of clones to the top n.

ylabelscounts or full

If set to counts (the default) y-axis ticks will be shown indicating the number of clones in the plot. If set to full, all features in overlapping_features will be shown for each row.

col_orderfunction or None

A function that is passed the pd.DataFrame and shall return a list of columns in the desired order.

row_orderfunction or None

A function that is passed the pd.DataFrame and shall return a list of row indexes.

pivot_hookfunction or None

A function to call on the pivoted table. Useful for filtering sequences based on their frequency across pools.

col_namerfunction

A function to rename columns. The function should accept a tuple and return a formatted string version.

highlightlist or function

A list of two-value tuples in the format [(color_hex, [indices], …] to highlight. Each item in the list specifies a color to use and the row indices to highlight with the color. The highlights are applied in order, so row indices which occur multiple times are colored by the last item in the list.

The indices should be match the format specified in clone_features.

Alternatively, a function can be passed which returns an array formatted as described and shown above.

For example, the following will color the CDR3 CARAFDHW in red and CARESLRFMDVW in green:

[
    ('#ff0000', ['CARAFDHW']),
    ('#00ff00', ['CARESLRFMDVW']),
]

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying DataFrame.

hicutils.plots.overlap.plot_upset(df, pool, size='clones', clone_features=['clone_id'], subplots=(), subplot_kind='violin', **kwargs)

Generates an UpSet plot of clonal data. The UpSet plot may be scaled by clones or copies with size and the definition of a clone can be varied with the clone_features parameter. Further, distributions of other variables such as cdr3_num_nts and shm can be placed above each intersection bar with subplots.

Parameters

dfpd.DataFrame

The DataFrame to use as the source of clonal overlap information.

poolstr

How to pool the clones to calculate overlap. Each pool value will be treated as a category in the UpSet plot.

sizestr, clones or copies

The number to use as the cardinality of overlap sizes.

clone_featureslist(str)

The feature(s) to use for clone definition. The default clone_id uses the clone definitions in df. This can be altered to any other columns in the DataFrame such as cdr3_aa to track clones across subjects.

subplotslist(str)

Features to plot as sns.catplot``s above each intersection bar. Valid options are ``shm and cdr3_num_nts.

subplot_kindstr

The kind of plot to use for subplots. Any valid sns.catplot type is allowed (e.g. box, violin)

kwargsdict

Other parameters to pass to usp.UpSet

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying overlap DataFrame.

Somatic hypermutation (SHM)

The somatic hypermutation (SHM) of a dataset can be plotted in a variety of ways including as a distribution, bar/violin plots, and as a range plot.

shm
hicutils.plots.shm.plot_most_mutated_pie(df, pool, colors, **kwargs)

Plots the most mutated pool in df as a pie chart.

Parameters

dfpd.DataFrame

The DataFrame used to plot the SHM.

poolstr

The pool to use for plotting.

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying DataFrame.

hicutils.plots.shm.plot_mutated_fraction(df, pool, threshold=2.0, **kwargs)

Plots the fraction of clones with greater than threshold SHM in each pool.

Parameters

dfpd.DataFrame

The DataFrame used to plot the SHM.

poolstr

The pool to use for plotting.

thresholdfloat

The SHM percentage threshold to use to determine if a clone is mutated.

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying DataFrame.

hicutils.plots.shm.plot_shm_aggregate(df, pool, **kwargs)

Categorically plots the SHM of each pool.

Parameters

dfpd.DataFrame

The DataFrame used to plot the SHM.

poolstr

The pool to use for plotting.

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying DataFrame.

hicutils.plots.shm.plot_shm_distribution(df, pool, size_metric, palette=None, hue_order=None, **kwargs)

Plots the SHM distribution of a pooled DataFrame using either clones, copies, or uniques as a size metric.

Parameters

dfpd.DataFrame

The DataFrame used to plot the SHM distribution.

poolstr

The pool to use for plotting.

size_metricstr

The metric to determine each clones’ size. Must be clones, copies, or uniques.

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying DataFrame.

hicutils.plots.shm.plot_shm_range(df, pool, buckets=(1, 10, 25), order=None, **kwargs)

Plot the range of clonal SHM for each pool.

Parameters

dfpd.DataFrame

The DataFrame used to plot the SHM.

poolstr

The pool to use for plotting.

bucketslist(int)

A list of cut-points to bin SHM. The default is (1, 10, 25) meaning clones will be stratified by SHM into the buckets [1, 10), [10, 25), and 25+. All intervals are left-closed; that is the lesser value in each interval is inclusive and the greater value is exclusive.

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying DataFrame.

CDR3 analysis

A number of CDR3 analysis plots are provided including CDR3 amino-acid usage both as a heatmap and also as logo plots. Additionally CDR3 spectratypes can be created to show the CDR3 length distribution and highlight the top copy clones.

cdr3_analysis
hicutils.plots.cdr3_analysis.plot_cdr3_aa_usage(df, pool, size_metric='clones', normalize_by='rows', cluster_by='both', figsize=(20, 10))

Plots CDR3 amino-acid usage separated by pool.

Parameters

dfpd.DataFrame

The DataFrame to use as the source of CDR3 amino-acid usage information.

poolstr

The pooling column to use for each row of the heatmap.

size_metricstr

The size metric which is plotted as the intensity of each cell. Must be one of clones, copies, or uniques.

normalize_bystr

Sets how to normalize the plot. If set to rows (the default) each row is normalized to sum to one. Setting it to cols causes each column (amino-acid) to sum to one.

cluster_bystr (rows, cols, or both) or None

Sets which clustering to display. Valid values are rows, cols, both, or clustering can be disabled with None.

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying DataFrame.

hicutils.plots.cdr3_analysis.plot_cdr3_distribution(df, pool, size_metric='clones', **kwargs)

Plots CDR3 length distribution.

Parameters

dfpd.DataFrame

The DataFrame to use for plotting CDR3 length.

poolstr

The pooling column to use for hue value.

size_metricstr

The size metric to use as the height for each bar.

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying DataFrame.

Creates a logo plot for CDR3 strings of a given length either by amino-acid or nucleotide.

Parameters

dfpd.DataFrame

The DataFrame to use as the source of CDR3 information.

bystr

Either cdr3_aa to plot amino-acids or cdr3_nt to plot nucleotides.

lengthint

The length of CDR3s to plot. Interpreted as the length of by.

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying DataFrame.

hicutils.plots.cdr3_analysis.plot_cdr3_spectratype(df, color_top=10, **kwargs)

Plots CDR3 length while annotating and highlighting the top color_top clones.

Parameters

dfpd.DataFrame

The DataFrame to use for plotting CDR3 length.

color_topint

The number of clones to highlight (default 10).

Returns

A tuple (g, df) where g is a handle to the plot and df is the underlying DataFrame.