Loading Data¶
There are multiple ways to load AIRR-seq data in hicutils:
(Recomended) Using existing un-pooled AIRR-formatted files with a metadata file with one row per file.
Using existing pooled AIRR-formatted files exported from ImmuneDB, where pooling metadata is embedded in the file names.
Directly downloading and loading data from a hosted ImmuneDB instance using its URL and database name.
Examples¶
import hicutils as hu
Importing from existing replicate files and associated metadata file¶
Finally you can load individual replicate files so long as their is an associated metadata file with the column replicate_name and then metadata for each file. For example, here are the files:
%%bash
ls example_data_immunedb
HPAP015.IgH_HPAP015_rep1_200p0ng.pooled.tsv HPAP015.IgH_HPAP015_rep2_200p0ng.pooled.tsv HPAP017.IgH_HPAP017_rep1_200p0ng.pooled.tsv HPAP017.IgH_HPAP017_rep2_200p0ng.pooled.tsv metadata.tsv
Given this, to load this directory run:
hu.io.read_directory('example_data_immunedb')
| clone_id | subject | v_gene | j_gene | functional | insertions | deletions | cdr3_nt | cdr3_num_nts | cdr3_aa | ... | METADATA_sequencing_date | METADATA_sequencing_type | METADATA_species | METADATA_umi | METADATA_gad | METADATA_ia2 | METADATA_iaa | METADATA_znt8 | METADATA_date_hic_received | METADATA_collapse_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8248 | 6311533 | HPAP015 | IGHV2-5 | IGHJ4 | T | NaN | NaN | TGTGCACACAGCTGGGTACGGTATAACAGTGGCTGGGGCTTTCACT... | 51 | CAHSWVRYNSGWGFHYW | ... | 2019-08-05 | Bulk | human | 0 | neg | neg | pos | neg | 2019-07-31 | NaN |
| 9810 | 6326493 | HPAP015 | IGHV2-70|2-70D | IGHJ4 | T | NaN | NaN | TGTGCACGGCCCCATGGCAGCAGTGGCTGGTACTACTTTGACTACTGG | 48 | CARPHGSSGWYYFDYW | ... | 2019-08-05 | Bulk | human | 0 | neg | neg | pos | neg | 2019-07-31 | NaN |
| 8697 | 6315829 | HPAP015 | IGHV2-5 | IGHJ4 | T | NaN | NaN | TGTGCCAGGGGCCAGTGGCTGGCACCGAACCACTTTGACTACTGG | 45 | CARGQWLAPNHFDYW | ... | 2019-08-05 | Bulk | human | 0 | neg | neg | pos | neg | 2019-07-31 | NaN |
| 7970 | 6308963 | HPAP015 | IGHV2-5 | IGHJ4 | T | NaN | NaN | TGTGCACACAGGGGCAGCAGCTGGGACTACTGG | 33 | CAHRGSSWDYW | ... | 2019-08-05 | Bulk | human | 0 | neg | neg | pos | neg | 2019-07-31 | NaN |
| 8549 | 6314347 | HPAP015 | IGHV2-5 | IGHJ4 | T | NaN | NaN | TGTGCGCACAGTACGATACGATTTCAGTACTACTTTGACTCCTGG | 45 | CAHSTIRFQYYFDSW | ... | 2019-08-05 | Bulk | human | 0 | neg | neg | pos | neg | 2019-07-31 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4137 | 7029341 | HPAP017 | IGHV1-46 | IGHJ3 | T | NaN | NaN | TGTGCGGCAGTTCGTTACTATGATAGTAGTGGTTATTTTGCTGCCG... | 87 | CAAVRYYDSSGYFAAGDSDYGRAGAFDIW | ... | 2019-08-05 | Bulk | human | 0 | pos | neg | neg | neg | 2019-07-31 | NaN |
| 4135 | 7029336 | HPAP017 | IGHV1-46 | IGHJ3 | T | NaN | NaN | TGTGCGGCAGCAAATTACTATGATAGNAGTGGTTATTACCACTATG... | 60 | CAAANYYDXSGYYHYAFDIW | ... | 2019-08-05 | Bulk | human | 0 | pos | neg | neg | neg | 2019-07-31 | NaN |
| 4134 | 7029309 | HPAP017 | IGHV1-46 | IGHJ3 | T | NaN | NaN | TGTGCGAGAGATCTCTATGATAGTATTGGTTATTACCGGGCCGANG... | 60 | CARDLYDSIGYYRAXAFDIW | ... | 2019-08-05 | Bulk | human | 0 | pos | neg | neg | neg | 2019-07-31 | NaN |
| 4133 | 7029295 | HPAP017 | IGHV1-46 | IGHJ3 | T | NaN | NaN | NGTGCGAGAGACAAGTATAGTGGGAGCTACTACTTGTCCGATGCTT... | 57 | XARDKYSGSYYLSDAFDIW | ... | 2019-08-05 | Bulk | human | 0 | pos | neg | neg | neg | 2019-07-31 | NaN |
| 9999 | 7116522 | HPAP017 | IGHV3-11 | IGHJ6 | T | NaN | NaN | TGTGCGAGAGCCTACAGCTATGGCCAATACTACTACTACGGTATGG... | 54 | CARAYSYGQYYYYGMDVW | ... | 2019-08-05 | Bulk | human | 0 | pos | neg | neg | neg | 2019-07-31 | NaN |
40000 rows × 48 columns
Importing from existing files with metadata in filenames¶
Alternatively, if you have existing files which were exported from ImmuneDB (either using immunedb_export ... clones ... or via the website), they can be imported directly. Take for example the files below:
%%bash
ls example_data_meta_in_names
HPAP015.T1D.pooled.tsv HPAP017.Control.pooled.tsv
The files can be imported with the following:
# Specify that the metadata in the filename is the disease status
# If there are multiple features separated with the _AND_ string
# per the ImmuneDB specification, the second parameter should
# be a list of all features (e.g. for age and siease ['age', 'disease'].
hu.io.read_tsvs('example_data_meta_in_names', ['disease'])
| clone_id | subject | v_gene | j_gene | functional | insertions | deletions | cdr3_nt | cdr3_num_nts | cdr3_aa | ... | copies | germline | parent_id | avg_v_identity | top_copy_seq | copies_fraction | copies_percent | shm | clones | disease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16548 | 6310562 | HPAP015 | IGHV2-5 | IGHJ4|5 | T | NaN | NaN | TGTGCACGTGCGCGGGGGGCTTATTGG | 27 | CARARGAYW | ... | 41 | CAGGTCACCTTGAAGGAGTCTGGTCCT---GCGCTGGTGAAACCCA... | NaN | 0.955849 | NNNNNNNNNNNNNNNNNNNNNNNNNNN---NNNNNNNNNNNNNNNN... | 0.000953 | 0.095316 | 4.415122 | 1 | T1D |
| 16771 | 6311533 | HPAP015 | IGHV2-5 | IGHJ4 | T | NaN | NaN | TGTGCACACAGCTGGGTACGGTATAACAGTGGCTGGGGCTTTCACT... | 51 | CAHSWVRYNSGWGFHYW | ... | 34 | CAGATCACCTTGAAGGAGTCTGGTCCT---ACGCTGGTGAAACCCA... | NaN | 0.988600 | NNNNNNNNNNNNNNNNNNNNNNNNNNN---NNNNNNNNNNNNNNNN... | 0.000790 | 0.079042 | 1.140000 | 1 | T1D |
| 19430 | 6326493 | HPAP015 | IGHV2-70|2-70D | IGHJ4 | T | NaN | NaN | TGTGCACGGCCCCATGGCAGCAGTGGCTGGTACTACTTTGACTACTGG | 48 | CARPHGSSGWYYFDYW | ... | 31 | CAGGTCACCTTGAAGGAGTCTGGTCCT---GCGCTGGTGAAACCCA... | NaN | 0.956600 | NNNNNNNNNNNNNNNNNNNNNNNNNNN---NNNNNNNNNNNNNNNN... | 0.000721 | 0.072068 | 4.340000 | 1 | T1D |
| 17713 | 6315829 | HPAP015 | IGHV2-5 | IGHJ4 | T | NaN | NaN | TGTGCCAGGGGCCAGTGGCTGGCACCGAACCACTTTGACTACTGG | 45 | CARGQWLAPNHFDYW | ... | 30 | CAGATCACCTTGAAGGAGTCTGGTCCT---ACGCTGGTGAAACCCA... | NaN | 0.953710 | NNNNNNNNNNNNNNNNNNNNNNNNNNN---NNNNNNNNNNNNNNNN... | 0.000697 | 0.069743 | 4.629000 | 1 | T1D |
| 7648 | 6262779 | HPAP015 | IGHV1-3 | IGHJ4 | T | NaN | NaN | TGTGCGAGAGCCGTGGAGAATCATTTTGACTGGTTAAGTAACTACTGG | 48 | CARAVENHFDWLSNYW | ... | 30 | CAGGTCCAGCTTGTGCAGTCTGGGGCT---GAGGTGAAGAAGCCTG... | NaN | 0.940033 | NNNNNNNNNNNNNNNNNNNNNNNNNNN---NNNNNNNNNNNNNNNN... | 0.000697 | 0.069743 | 5.996667 | 1 | T1D |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8487 | 7016857 | HPAP017 | IGHV1-3 | IGHJ3 | F | NaN | NaN | TGNNCGAGACAGGGTGCGTAGCAGTGGCTGGTACTGTGGGGGGGGG... | 63 | XXRQGA*QWLVLWGGDAFDIW | ... | 1 | CAGGTCCAGCTTGTGCAGTCTGGGGCT---GAGGTGAAGAAGCCTG... | NaN | 0.967300 | NNNNNNNNNNNNNNNNNNNNNNNNNNN---NNNNNNNNNNNNNNNN... | 0.000020 | 0.001979 | 3.270000 | 1 | Control |
| 8488 | 7016859 | HPAP017 | IGHV1-3 | IGHJ3 | T | NaN | NaN | TGTGCGAGAGTCATGGTGGGTTATAGTGGCTACGGAGGTNNCTACG... | 75 | CARVMVGYSGYGGXYXVSGYAFDIW | ... | 1 | CAGGTCCAGCTTGTGCAGTCTGGGGCT---GAGGTGAAGAAGCCTG... | NaN | 0.972100 | NNNNNNNNNNNNNNNNNNNNNNNNNNN---NNNNNNNNNNNNNNNN... | 0.000020 | 0.001979 | 2.790000 | 1 | Control |
| 8492 | 7016881 | HPAP017 | IGHV1-3 | IGHJ3 | T | NaN | NaN | TGTGCGAGAGGGGGTTNTCGGCAGAGGGTGGCGAATTACTNTGGTT... | 72 | CARGGXRQRVANYXGSGRGAFDIW | ... | 1 | CAGGTCCAGCTTGTGCAGTCTGGGGCT---GAGGTGAAGAAGCCTG... | NaN | 0.958100 | NNNNNNNNNNNNNNNNNNNNNNNNNNN---NNNNNNNNNNNNNNNN... | 0.000020 | 0.001979 | 4.190000 | 1 | Control |
| 8493 | 7016885 | HPAP017 | IGHV1-3 | IGHJ3 | T | NaN | NaN | TGTGCGAGAGTATCCAGCTATGGTTGGGAAAGTGCAGGGCCTGATG... | 60 | CARVSSYGWESAGPDAFDXW | ... | 1 | CAGGTCCAGCTTGTGCAGTCTGGGGCT---GAGGTGAAGAAGCCTG... | NaN | 0.953500 | NNNNNNNNNNNNNNNNNNNNNNNNNNN---NNNNNNNNNNNNNNNN... | 0.000020 | 0.001979 | 4.650000 | 1 | Control |
| 19892 | 7116522 | HPAP017 | IGHV3-11 | IGHJ6 | T | NaN | NaN | TGTGCGAGAGCCTACAGCTATGGCCAATACTACTACTACGGTATGG... | 54 | CARAYSYGQYYYYGMDVW | ... | 1 | CAGGTGCAGCTGGTGGAGTCTGGGGGA---GGCTTGGTCAAGCCTG... | NaN | 0.930200 | NNNNNNNNNNNNNNNNNNNNNNNNNNN---NNNNNNNNNNNNNNNN... | 0.000020 | 0.001979 | 6.980000 | 1 | Control |
39513 rows × 22 columns
Downloading from an ImmuneDB Link (Slow for large datasets)¶
For a hosted ImmuneDB instance, you can directly download and load data from the website link. Depending on the database size, initially gathering the data may take some time. After it is downloaded, the cached version will be used unless the data is explicitly deleted.
hu.io.pull_immunedb_data(
'https://myurl.com/immunedb',
'mydb',
'example_data_immunedb'
)
| clone_id | subject | v_gene | j_gene | functional | insertions | deletions | cdr3_nt | cdr3_num_nts | cdr3_aa | ... | METADATA_sequencing_date | METADATA_sequencing_type | METADATA_species | METADATA_umi | METADATA_gad | METADATA_ia2 | METADATA_iaa | METADATA_znt8 | METADATA_date_hic_received | METADATA_collapse_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8248 | 6311533 | HPAP015 | IGHV2-5 | IGHJ4 | T | NaN | NaN | TGTGCACACAGCTGGGTACGGTATAACAGTGGCTGGGGCTTTCACT... | 51 | CAHSWVRYNSGWGFHYW | ... | 2019-08-05 | Bulk | human | 0 | neg | neg | pos | neg | 2019-07-31 | NaN |
| 9810 | 6326493 | HPAP015 | IGHV2-70|2-70D | IGHJ4 | T | NaN | NaN | TGTGCACGGCCCCATGGCAGCAGTGGCTGGTACTACTTTGACTACTGG | 48 | CARPHGSSGWYYFDYW | ... | 2019-08-05 | Bulk | human | 0 | neg | neg | pos | neg | 2019-07-31 | NaN |
| 8697 | 6315829 | HPAP015 | IGHV2-5 | IGHJ4 | T | NaN | NaN | TGTGCCAGGGGCCAGTGGCTGGCACCGAACCACTTTGACTACTGG | 45 | CARGQWLAPNHFDYW | ... | 2019-08-05 | Bulk | human | 0 | neg | neg | pos | neg | 2019-07-31 | NaN |
| 7970 | 6308963 | HPAP015 | IGHV2-5 | IGHJ4 | T | NaN | NaN | TGTGCACACAGGGGCAGCAGCTGGGACTACTGG | 33 | CAHRGSSWDYW | ... | 2019-08-05 | Bulk | human | 0 | neg | neg | pos | neg | 2019-07-31 | NaN |
| 8549 | 6314347 | HPAP015 | IGHV2-5 | IGHJ4 | T | NaN | NaN | TGTGCGCACAGTACGATACGATTTCAGTACTACTTTGACTCCTGG | 45 | CAHSTIRFQYYFDSW | ... | 2019-08-05 | Bulk | human | 0 | neg | neg | pos | neg | 2019-07-31 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4137 | 7029341 | HPAP017 | IGHV1-46 | IGHJ3 | T | NaN | NaN | TGTGCGGCAGTTCGTTACTATGATAGTAGTGGTTATTTTGCTGCCG... | 87 | CAAVRYYDSSGYFAAGDSDYGRAGAFDIW | ... | 2019-08-05 | Bulk | human | 0 | pos | neg | neg | neg | 2019-07-31 | NaN |
| 4135 | 7029336 | HPAP017 | IGHV1-46 | IGHJ3 | T | NaN | NaN | TGTGCGGCAGCAAATTACTATGATAGNAGTGGTTATTACCACTATG... | 60 | CAAANYYDXSGYYHYAFDIW | ... | 2019-08-05 | Bulk | human | 0 | pos | neg | neg | neg | 2019-07-31 | NaN |
| 4134 | 7029309 | HPAP017 | IGHV1-46 | IGHJ3 | T | NaN | NaN | TGTGCGAGAGATCTCTATGATAGTATTGGTTATTACCGGGCCGANG... | 60 | CARDLYDSIGYYRAXAFDIW | ... | 2019-08-05 | Bulk | human | 0 | pos | neg | neg | neg | 2019-07-31 | NaN |
| 4133 | 7029295 | HPAP017 | IGHV1-46 | IGHJ3 | T | NaN | NaN | NGTGCGAGAGACAAGTATAGTGGGAGCTACTACTTGTCCGATGCTT... | 57 | XARDKYSGSYYLSDAFDIW | ... | 2019-08-05 | Bulk | human | 0 | pos | neg | neg | neg | 2019-07-31 | NaN |
| 9999 | 7116522 | HPAP017 | IGHV3-11 | IGHJ6 | T | NaN | NaN | TGTGCGAGAGCCTACAGCTATGGCCAATACTACTACTACGGTATGG... | 54 | CARAYSYGQYYYYGMDVW | ... | 2019-08-05 | Bulk | human | 0 | pos | neg | neg | neg | 2019-07-31 | NaN |
40000 rows × 48 columns
API Documentation¶
- hicutils.core.io.pull_immunedb_data(endpoint, db_name, out_name, skip_existing=True)¶
Downloads unpooled clonal data from an ImmuneDB instance.
Parameters¶
- endpointstr
The endpoint to the hosted ImmuneDB instance. For example
https://mydomain.com/immunedb.- db_namestr
The database name itself. For example
my_db.- out_namestr
The name of the directory into which the data will be saved.
Returns¶
A
pd.DataFramewith all clonal data downloaded from the ImmuneDB instance.Examples¶
>>> io.pull_immunedb_data( 'https://mydomain.com/immunedb', 'my_db', 'my_db_data' )
- hicutils.core.io.read_directory(path)¶
Reads AIRR-formatted TSV files and joins it with an associated metadata.tsv file to return a unified pd.DataFrame.
Parameters¶
- pathstr
Path to AIRR-formatted files and metadata.tsv
Returns¶
pd.DataFrame with AIRR-seq data and metadata.
- hicutils.core.io.read_metadata(path)¶
Reads a metadata file into a pd.DataFrame, prefixing METADATA_ to each field and setting the replicate_name to its index.
Parameters¶
- pathstr
Path to metadata file
Returns¶
pd.DataFrame containing the metadata.
- hicutils.core.io.read_tsvs(path, features=())¶
Reads AIRR-formatted input files into a single DataFrame and populates common fields.
Parameters¶
- pathstr
Path to directory containing
.pooled.tsvfiles- featureslist, optional
List of features which are encoded in the file names.
Returns¶
Single DataFrame containing the concatenated AIRR-formatted data.
- hicutils.core.io.save_fig_and_data(name, df, path='./', ext='pdf', fig_args=None, **kwargs)¶
Saves the most recently generated figure and associated data to files.
Parameters¶
- namestr
The filename to use for both the figure and data file.
- dfpd.DataFrame
The DataFrame used to generate the figure.
- pathstr, optional
Path to directory into which the files should be saved.
- extstr, optional
The extension of the figure file. Defaults to pdf but can be any image format such as
png.- fig_argsdict
Additional parameters which will be passed to
plt.savefig- kwargsdict
Additional parameters which will be passed to
df.to_csv