Abstract Body

Background: The Illumina MiSeq DNA sequencing system generates several gigabases of short reads per run with a relatively low error rate. We previously described longitudinal contamination on this platform which has since been addressed by a post-run bleach wash. Here we characterize rates and sources of systematic low level, within-run cross-sample contamination, an under-reported issue for this platform.

Methods: In order to assess cross-contamination observed in previous experiments, two libraries of disparate amplicons (HCV NS5B, human HLA-B) were sequenced at high read depth on a single MiSeq run (v2, 2x250bp). HCV RNA was extracted from 24 patient-derived plasma samples using a NucliSens easyMag, and a 327-bp fragment of NS5B was amplified by nested RT-PCR. Human genomic DNA was extracted from 33 whole blood samples and a region spanning HLA-B exons 2 and 3 was amplified. All stages of HCV and HLA library preparation were performed on different days by different staff. Indexed PCR primers for these targets were ordered months apart, effectively ruling out primer synthesis as a source of cross-contamination. Including replicates, 69 amplicons were sequenced using a total of 56 Illumina index pairs. Sequenced HCV and HLA samples shared either zero, one or two indices with samples of the opposite type. Short read data were cleaned and iteratively mapped using a custom pipeline built around bowtie2 and samtools.

Results: The run cluster density was 940 K/mm2 with 89% of reads passing filters, suggesting normal instrument performance and library preparation. On average, approximately 141,000, and 177,000-fold coverage was obtained for HCV and HLA-B, respectively. Interestingly, up to 3637 HLA-B reads (1.8% of total reads) were observed in samples expected to contain only HCV, and up to 217 HCV reads (0.09%) were observed in HLA-B samples. Screening all suspected contaminants (e.g. HCV reads in an HLA sample) against all consensus sequences indicated that the source of contamination was far more likely to be a sample that shared one Illumina index than a sample that shared none (OR=15.7, p=10-11). Cross-contamination between reads sharing one index was also observed between samples of the same type.

Conclusions: The MiSeq is subject to low-level cross-contamination from samples that share one “barcode” in a dual-indexing strategy. Accurate interpretation of low-frequency variants detected by deep sequencing requires knowledge of all other samples run on the instrument and their associated barcodes.