ACEseq - Allele-specific copy numbers from sequencing¶
License¶
The license of the ACEseq code is MIT <https://github.com/eilslabs/ACEseqWorkflow/blob/github/LICENSE.txt>. See here <https://github.com/eilslabs/ACEseqWorkflow/blob/github/package_LICENSES.md> licenses of packages used by the workflow.
Need help?¶
In case of question pleae contact Kortine Kleinheinz (k.kleinheinz@dkfz-heidelberg.de)
Installation & Run instructions¶
To run the ACEseq-workflow multiple components are needed:
- ACEseq workflow plugin
- The Roddy workflow management framework
- Software stack
- Reference data
- COWorkflowsBasePlugin
The The Standard Way to install the workflow is described below and involves the installation of each of these components. For the older 1.2.10 release we currently also provide prepackaged files and a Docker container. See Prepackaged files (ACEseq 1.2.10 only) below for instructions.
The Standard Way¶
The standard way to install the workflow is the manual installation of all components.
- Download the COWorkflowBasePlugin zip-archive from Github-Releases. The version to download can be found in the ACEseq buildinfo.txt.
- Download the ACEseq zip-archive from Github-Releases. The archive already contains a Jar-archive with the compiled Java/Groovy code (JAR-file) for the given Roddy API version. No compilation of the plugin is therefore required.
- The file ACEseq buildinfo.txt in also shows you the Roddy API version that you need for the chosen ACEseq workflow version.
- Install the required Roddy version. Please see the Roddy repository for installation instructions for Roddy.
- Install the software stack (see Software Stack (Conda) below) via Conda
- Install the reference files (see Reference files below) via the preparation script.
Software Stack (Conda)¶
The workflow contains a description of a Conda environment. A number of Conda packages from BioConda are required. You should set up the Conda environment at a centralized position available from all compute hosts.
First install the BioConda channels:
conda config --add channels r
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
Then install the environment
conda env create -n ACEseqWorkflow -f $PATH_TO_PLUGIN_DIRECTORY/resources/analysisTools/copyNumberEstimationWorkflow/environments/conda.yml
The name of the Conda environment is arbitrary but needs to be consistent with the condaEnvironmentName variable. You can set the condaEnvironmentName variable in any of the loaded configuration files (see Roddy documentation) or even directly in your Roddy call via –cvalues=”condaEnvironmentName:$value”.
If you do not want to use Conda, you can get a complete list of all packages and package versions Conda would install from the $PATH_TO_PLUGIN_DIRECTORY/resources/analysisTools/copyNumberEstimationWorkflow/environments/conda.yml.
Reference files¶
The workflow uses various files as reference files, such as a reference genome or annotation files. Depending on the contents of these files also the outcome of your analysis may change. We provide installation scripts in the installation/ directory (currently only in the github branch of the repository). To download and prepare the reference files please check out the ACEseq repository and do
bash $PATH_TO_PLUGIN_DIRECTORY/installation/downloadRefrences $targetDirectory
with $targetDirectory being the directory into which you want to install the files. The variable baseDirectoryReference in your configurations needs to be set to the $targetDirectory path.
Note that the current plugin version is tuned to be run on the hg19 human assembly, but a liftover of all files should probably enable a run on GRch38.
Prepackaged files (ACEseq 1.2.10 only)¶
On http://bfg-nfs3.ipmb.uni-heidelberg.de you can find archives for the 1.2.10 plugin version. The prepackaged zip files contains a full Roddy / Plugin setup and include different scripts to install all necessary software and download the required reference files. Currently, we do not intent to update these prepackaged installation files or the Docker version. Note that the Roddy version packaged not capable of submitting to LSF.
Please see the standard way to install recent workflow versions.
Stand-alone Roddy for Execution on HTC Cluster¶
To run the Roddy-based version of ACEseq please download the pre-packed zip file from http://bfg-nfs3.ipmb.uni-heidelberg.de. Three steps are required to ensure running of ACEseq.
- Run the “prepareRoddyInstallation.sh” script.
- Download all reference files as specified in the section “Reference files” (below).
- Set up the Conda environment or install the necessary software as specified in the section “Software” (below).
Before running ACEseq a few parameters need to be adjusted in the configuration files. The output directory is specified in $PATH_TO_ACEseq_RODDY_VERSION/configurations/projectsACEseqTest.xml. Here the variables “baseDirectoryReference”, “inputBaseDirectory”, “outputBaseDirectory”, “outputAnalysisBaseDirectory” need to be set. If no SVs should be included the following configuration values (cvalues) should be included:
<cvalue name='runWithSv' value='true' type="boolean"/>
<cvalue name='SV' value='yes' type="boolean"/>
Otherwise “svOutputDirectory” and the SV bedpe filename in the filenames section need to be set.
<configurationvalues>
<cvalue name='svOutputDirectory' value='${outputAnalysisBaseDirectory}/nameOfDirectoryWithSVResults' type="path"/>
</configurationvalues>
<filenames package='de.dkfz.b080.co.files' filestagesbase='de.dkfz.b080.co.files.COFileStage'>
<filename class="TextFile" onMethod="de.dkfz.b080.co.aceseq.ACESeqMethods.mergeSv"
selectiontag="svFileTag"
pattern='${svOutputDirectory}/${pid}_svs.bedpe'/>
</filenames>
Technical specifications are set in the file $PATH_TO_ACEseq_RODDY_VERSION/configurations/applicationProperties.ini. The path to the project.xml and the path to the plugins ($PATH_TO_ACEseq_RODDY_VERSION/Roddy/dist/plugins/) need to be set under configurationDirectories and pluginDirectories. Finally the job manager and execution host need to be set.
Please have a look at the following default applicationProperties.ini file:
[COMMON]
useRoddyVersion=current # Use the most current version for tests
[DIRECTORIES]
configurationDirectories=[FOLDER_WITH_CONFIGURATION_FILES]
pluginDirectories=[FOLDER_WITH_PLUGINS]
[COMMANDS]
jobManagerClass=de.dkfz.roddy.execution.jobs.direct.synchronousexecution.DirectSynchronousExecutionJobManager
#jobManagerClass=de.dkfz.roddy.execution.jobs.cluster.pbs.PBSJobManager
#jobManagerClass=de.dkfz.roddy.execution.jobs.cluster.sge.SGEJobManager
#jobManagerClass=de.dkfz.roddy.execution.jobs.cluster.slurm.SlurmJobManager
#jobManagerClass=de.dkfz.roddy.execution.jobs.cluster.lsf.rest.LSFRestJobManager
commandFactoryUpdateInterval=300
commandLogTruncate=80 # Truncate logged commands to this length. If <= 0, then no truncation.
[COMMANDLINE]
CLI.executionServiceUser=USERNAME
CLI.executionServiceClass=de.dkfz.roddy.execution.io.LocalExecutionService
#CLI.executionServiceClass=de.dkfz.roddy.execution.io.SSHExecutionService
CLI.executionServiceHost=[YOURHOST]
CLI.executionServiceAuth=keyfile
#CLI.executionServiceAuth=password
CLI.executionServicePasswd=
CLI.executionServiceStorePassword=false
CLI.executionServiceUseCompression=false
CLI.fileSystemInfoProviderClass=de.dkfz.roddy.execution.io.fs.FileSystemInfoProvider
To execute ACEseq run
sh $PATH_TO_ACEseq_RODDY_VERSION//Roddy/roddy.sh rerun ACEseq@copyNumberEstimation $pid \
--useconfig=$PATH_TO_ACEseq_RODDY_VERSION/configuration/applicationProperties.ini \
--cvalues="bamfile_list:$pathToControlBamFile;$pathToTumorBamFile,sample_list:control;tumor,possibleControlSampleNamePrefixes:control,possibleTumorSampleNamePrefixes:tumor"
More information on Roddy can be found here.
Docker version¶
- Download all reference files as specified in the section below.
- Download the Base and ACEseq Docker images from the website: http://bfg-nfs3.ipmb.uni-heidelberg.de
- Import both files with (names might differ based on supplied version):
docker load < BaseDockerContainer.tar.gz
docker load < ACEseqDockerContainer.tar.gz
- Download the control files archive and extract them. The directory contains the file “roddy.sh”. Please call this script with: bash roddy.sh. You will see:
#!/bin/bash
# 1: Run mode, which might be "run" or "testrun"
# 2: Configuration identifier, normally "ACEseq"
# 3: Configuration directory
# 4: Dataset identifier / PID
# 5: Control bam file
# 6: Tumor bam file
# 7: Control bam sample name
# 8: Tumor bam sample name
# 9: Reference files path
# 10: Output folder
# 11: Optional: The SV file
An example call is:
bash roddy.sh run ACEseq ./config/ stds /home/roddy/someproject/control_MB99_merged.mdup.bam /home/roddy/someproject/tumor_MB99_merged.mdup.bam control tumor /icgc/ngs_share/assemblies/hg19_GRCh37_1000genomes ./output
Here you tell roddy to run the ACEseq configuration using the config folder in the current directory with a control and tumor bam. Also you tell Roddy the samples for both files namely control and tumor. Finally, you supply the path to the reference files and the folder where you will store your output data.
QuickStart¶
To start ACEseq download package from here and install the reference files and conda package as described under Installation & Run instructions.
sh $PATH_TO_PLUGIN_DIRECTORY/Roddy/roddy.sh rerun ACEseq@copyNumberEstimation $pid \
--useconfig=$PATH_TO_PLUGIN_DIRECTORY/applicationProperties.ini \
--cvalues="bamfile_list:$pathToControlBamFile;$pathToTumorBamFile,sample_list:control;tumor,possibleControlSampleNamePrefixes:control,possibleTumorSampleNamePrefixes:tumor"
Following parameters should be changed in the project.xml:
- baseDirectoryReference
- outputBaseDirectory
- outputFileGroup (in case all outputfiles should have different group than primary group)
Alternative running modes:
- runWithoutControl (in case it should be run without control)
- runwithFakeControl (in case the coverage should be taken from a different control)
Requirements¶
Hardware¶
ACEseq requires the execution of multiple jobs that are highly parallelized in the beginning but linearize towards the end of the workflow. It requires a maximum of 50g RAM in few of the Jobs. On a HPC cluster with multiple cores available it will usually finish within 24h (100-160 CPU h). The final output usually requires between 4 and 6g memory.
Software¶
The installation of all required software can be found under Installation & Run instructions.
Alternative Running Modes¶
Multiple alternative running modes are enabled with ACEseq.
Run Without Control¶
If no control sample is available, but ACEseq was already used to process other tumor sample pairs one of their control coverage profile can be used for normalization. In this case the patient’s sex needs to be set with PATIENTSEX=”male|female|klinefelter”.
Please specify the path and prefix to a control coverage profile for a male (MALE_FAKE_CONTROL_PRE) and a female patient (FEMALE_FAKE_CONTROL_PRE) so it can be matched to the processed sample. To activate this option the value runWithout control needs to be set to ‘true’, either via the command line execution under cvalues or in the project.xml.
<cvalue name="runWithoutControl" value="true" type="boolean" />
<cvalue name="PATIENTSEX" value="male|female|klinefelter" type="boolean" />
<cvalue name='MALE_FAKE_CONTROL_PRE' value="pathToPID/${pid}/ACEseq/cnv_snp/${pid}.chr" type='path'
description="path and prefix to chromosome-wise 1kb coverage file used for fake control workflow for male patients" />
<cvalue name='FEMALE_FAKE_CONTROL_PRE' value="pathToPID/${pid}/ACEseq/cnv_snp/${pid}.chr" type='path'
description="path and prefix to chromosome-wise 1kb coverage file used for fake control workflow for female patients" />
Run quality check only¶
In case you do not want to run the full ACEseq pipeline immediately, but would rather access the sample’s quality first you can start ACEseq with the option “runQualityCheckOnly” set to “true”.
Replace low quality control¶
If a control sample is very noisy and masks CNAs it can be replaced with the coverage profile from a different control of the same sex. For this run ACEseq with “runWithFakeControl” set to “true” and specify the values “FEMALE_FAKE_CONTROL_PRE” and “MALE_FAKE_CONTROL_PRE” as described in the section for analysis without matched control.
Run with/without SV breakpoint incorporation¶
To process samples with incorporation of SV breakpoints set the following in the project.xml:
<configurationvalues>
<cvalue name='svOutputDirectory' value='${outputAnalysisBaseDirectory}/nameOfDirectoryWithSVResults' type="path"/>
<cvalue name='runWithSv' value='true' type="boolean"/>
</configurationvalues>
<filenames package='de.dkfz.b080.co.files' filestagesbase='de.dkfz.b080.co.files.COFileStage'>
<filename class="TextFile" onMethod="de.dkfz.b080.co.aceseq.ACESeqMethods.mergeSv"
selectiontag="svFileTag"
pattern='${svOutputDirectory}/${pid}_svs.bedpe'/>
</filenames>
If the bedpe file does not exist ACEseq will submit all steps until the bedpe file is required. A rerun once the SV file is generated will start the pipeline up from the point where SV breakpoints are incorporated.
To process a samples without SVs please set the following in the project.xml:
<cvalue name='runWithSv' value='false' type="boolean"/>
<cvalue name='SV' value='no' type="string"/>
Input Parameters¶
Multiple parameters can be set with ACEseq though not all are necessary to change. This table gives and overview and description for all available parameters
name | value | type | description |
---|---|---|---|
aceseqOutputDirectory | ${outputAnalysisBaseDirectory}/ACEseq_${tumorSample} | path | |
svOutputDirectory | ${outputAnalysisBaseDirectory}/SV_calls | path | |
crestOutputDirectory | ${outputAnalysisBaseDirectory}/crest | path | |
cnvSnpOutputDirectory | ${aceseqOutputDirectory}/cnv_snp | path | |
imputeOutputDirectory | ${aceseqOutputDirectory}/phasing | path | |
plotOutputDirectory | ${aceseqOutputDirectory}/plots | path | |
runWithoutControl | false | boolean | use control for analysis (false|true) |
minHT | 5 | integer | minimum number of consecutive SNPs to be considered for haploblocks |
snp_min_coverage | 5 | integer | minimum coverage in control for SNP |
cnv_min_coverage | 5000 | integer | minimum coverage for 1kb windows to be considered for merging in 10kb windows |
mapping_quality | 1000 | integer | minimum mapping quality for 1kb windows to be considered for merging in 10kb windows (maximum mappability) |
min_windows | 5 | integer | minimum number of 1kb windows fullfilling cnv_min_coverage and mapping_quality to obtain merged 10kb windows |
min_X_ratio | 0.8 | float | minimum ratio for number of reads on chrY per base over number of reads per base over whole genome to be considered as female |
min_Y_ratio | 0.12 | float | minimum ratio for number of reads on chrY per base over number of reads per base over whole genome to be considered as male |
LOWESS_F | 0.1 | float | f parameter for R lowess function |
SCALE_FACTOR | 0.9 | float | scale_factor for R lowess function |
COVERAGEPLOT_YLIMS | 4 | float | ylims for Rplots in GC-bias plots |
FILENAME_GC_CORRECT_PLOT | ${plotOutputDirectory}/${pid}_gc_corrected.png | path | gc-bias plot, before/during/after correction |
GC_bias_json_key | gc-bias | string | key in GC-bias json |
FILE_DENSITYBETA | ${aceseqOutputDirectory}/densityBeta.pdf | path | |
min_DDI_length | 1000 | integer | minimum length for DEL/DUP/INV to be considered for breakpoint integration |
selSVColumn | eventScore | string | column from bedpe file to be recored in ${pid}_sv_points.txt file |
min_seg_width | 2000 | integer | segmentByPairedPSCBS() minwidth parameter in PSCBS R package |
undo_SD | 24 | integer | segmentByPairedPSCBS() undo.SD parameter in PSCBS R package |
pscbs_prune_height | 0 | integer | pruneByHClust() parameter h in PSCBS R package |
min_segment_map | 0.6 | float | minimum average mappability over segment to be kept after segmentation |
min_seg_length_prune | 9000 | integer | maximum of segment to be considered for merging to neighbouring segment prior to clustering |
min_num_SNPs | 15 | integer | maximum number of SNPs in segment to be considered for merging to neighbouring segment prior to clustering |
clustering | yes | string | should segments be clustered (yes|no), coerage and BAF will be estimated and assigned clusterwide |
min_cluster_number | 1 | integer | minimum number of clusters to be tried with BIC |
min_membership | 0.8 | float | obsolete |
min_distance | 0.05 | float | min_distance |
haplogroupFilePrefix | haploblocks_chr | string | prefix for file with haplogroups per chromosome |
haplogroupFileSuffix | txt | string | suffix for file with haplogroups per chromosome |
haplogroupFilePath | ${imputeOutputDirectory}/${haplogroupFilePrefix} | path | |
min_length_purity | 1000000 | integer | minimum length of segments to be considered for tumor cell content and ploidy estimation |
min_hetSNPs_purity | 500 | integer | minimum number of control heterozygous SNPs in segments to be considered for tumor cell content and ploidy estimation |
dh_stop | max | string | |
min_length_dh_stop | 1000000 | integer | |
dh_zero | no | string | |
purity_min | 0.3 | float | minimum tumor cell content allowed |
purity_max | 1.0 | float | i |
ploidy_min | 1.0 | float | |
ploidy_max | 6.5 | float | |
SNP_VCF_CNV_PATH | ${cnvSnpOutputDirectory}/${pid}.chr | path | If the value is changed the value for the filename pattern MUST also be changed. |
SNP_VCF_CNV_PATH_STR | ${SNP_VCF_CNV_PATH} | string | This value must be converted to a string because of a bug. |
SNP_SUFFIX | snp.tab.gz | string | |
CHR_NR | ${CHR_PREFIX}${PARM_CHR_INDEX}${CHR_SUFFIX} | string | |
CHR_NAME | ${PARM_CHR_INDEX} | string | |
AUTOSOME_INDICES | ( {1..22} ) | bashArray | |
CREST | yes | string | include SV breakpoints in analysis (yes|no) |
mpileup_qual | 0 | integer | quality used for parameter ‘Q’ in samtools mpileup |
CNV_MPILEUP_OPTS | “-A -R -B -Q ${mpileup_qual} -q 1 -I “ | string | options for mpileup to determine which bases/reads to use |
FILE_VCF_SUF | vcf | string | suffix for vcf files |
FILE_TXT_SUF | txt | string | suffix for txt files |
phasedGenotypesFilePrefix | phased_chr | string | prefix for phased genotypes file |
unphasedGenotypesFilePrefix | unphased_chr | string | prefix for unphased genotypes file |
phasedGenotypesFileSuffix | ${FILE_VCF_SUF} | string | suffix for phased genotypes file |
unphasedGenotypesFileSuffix | ${FILE_VCF_SUF} | string | suffix for unphased genotypes file |
BCFTOOLS_OPTS | “-vgN “ | string | bcftools options for imputation |
FAKE_CONTROL_POST | .cnv.anno.tab.gz | string | suffix for chromosome wise 1kb coverage files used for fake control workflow |
PATIENTSEX | male | string | patient sex used in case of no control workflow (male|female|klinefelter) |
CNV_ANNO_SUFFIX | cnv.anno.tab.gz | string | suffix for mappability annotated chromosome-wise 1kb coverage files |
CNV_SUFFIX | cnv.tab.gz | string | suffix chromosome-wise 1kb coverage files |
FILE_UNPHASED_PRE | ${imputeOutputDirectory}/${unphasedGenotypesFilePrefix} | path | |
FILE_UNPHASED_GENOTYPE | ${imputeOutputDirectory}/unphased_genotype_chr | path | |
FILE_PHASED_PRE | ${imputeOutputDirectory}/${phasedGenotypesFilePrefix} | path | |
FILE_PHASED_GENOTYPE | ${imputeOutputDirectory}/phased_genotype_chr | path | |
FILE_INFO | info | string | |
FILE_INFO_SAMPLE | info_by_sample | string | |
FILE_HAPS | haps | string | |
FILE_HAPS_CONF | haps_confidence | string | |
FILE_SUMMARY | summary | string | |
FILE_WARNINGS | warnings | string | |
FILE_PART | part | string | |
FILE_SAMPLE_G | ${imputeOutputDirectory}/sample_g.txt | path | sample_g file used by imputation on X chromosome for females |
MALE_FAKE_CONTROL_PRE | ${pathToACEseqResults}/cnv_snp/${pid}.chr | path | path and prefix to chromosome-wise 1kb coverage file used for fake control workflow for male patients |
FEMALE_FAKE_CONTROL_PRE | ${pathToACEseqResults}/cnv_snp/${pid}.chr | path | path and prefix to chromosome-wise 1kb coverage file used for fake control workflow for female patients |
PLOT_PRE | ${aceseqOutputDirectory}/${pid}_plot | path | |
FILE_MOST_IMPORTANT_INFO_SEG_PRE | ${pid}_most_important_info | string | |
FILE_MOST_IMPORTANT_INFO_SEG_POST | .txt | string | |
FILE_SEGMENT_VCF_PRE | ${aceseqOutputDirectory}/${pid} | path | |
FILE_SEGMENT_VCF_POST | .cnv.vcf | string | |
outputUMask | 007 | string | |
outputFileGroup | $accessGroup | group for output files and directories | |
outputAccessRights | u+rw,g+rw,o-rwx | access rights for written files | |
outputAccessRightsForDirectories | u+rwx,g+rwx,o-rwx | access rights for written directories | |
possibleControlSampleNamePrefixes | ( blood) | bashArray | possible prefix of control bam if named ${prefix}_${pid}_$mergedBamSuffix |
possibleTumorSampleNamePrefixes | ( tumor ) | bashArray | same as possibleControlSampleNamePrefixes |
referenceGenome_1KGRef | ${path}/hs37d5.fa | path | reference genome file |
REFERENCE_GENOME | ${referenceGenome_1KGRef} | string | |
dbSNP_FILE | ${path}/00-All.SNV.vcf.gz | path | |
MAPPABILITY_FILE | ${path}/wgEncodeCrgMapabilityAlign100mer_chr.bedGraph.gz | path | mappability file |
CHROMOSOME_LENGTH_FILE | ${path}/chrlengths.txt | path | |
REPLICATION_TIME_FILE | ${path}/ReplicationTime_10cellines_mean_10KB.Rda | path | replication timing file |
GC_CONTENT_FILE | ${path}/hg19_GRch37_100genomes_gc_content_10kb.txt | path | |
GENETIC_MAP_FILE | ${path}/genetic_map_chr${CHR_NAME}_combined_b37.txt | path | impute files |
KNOWN_HAPLOTYPES_FILE | ${path}/ALL.chr${CHR_NAME}.integrated_phase1_v3. 20101123.snps_indels_svs.genotypes.nomono.haplotypes.gz | path | impute files |
KNOWN_HAPLOTYPES_LEGEND_FILE | ${path}ALL.chr${CHR_NAME}.integrated_phase1_v3. 20101123.snps_indels_svs.genotypes.nomono.legend.gz | path | impute files |
GENETIC_MAP_FILE_X | ${path}/genetic_map_chrX_nonPAR_combined_b37.txt | path | impute files |
KNOWN_HAPLOTYPES_FILE_X | ${path}/ALL_1000G_phase1integrated_v3_chrX_nonPAR_impute.hap.gz | path | impute files |
KNOWN_HAPLOTYPES_LEGEND_FILE_X | ${path}/ALL_1000G_phase1integrated_v3_chrX_nonPAR_impute.legend.gz | path | impute files |
outputExecutionDirectory | ${path}/exec_${executionTimeString} | path to log files | |
imputeBaseDirectory | ${path}/ | path | directory for impute files |
mergedBamSuffix | merged.mdup.bam | string | A list of all known suffixes for merged bam files. I.e. merged.dupmark.bam, merged.mdup.bam… |
mergedBamSuffixList | ${mergedBamSuffix} | string | A list of all known suffixes for merged bam files. I.e. merged.dupmark.bam, merged.mdup.bam… |
defaultMergedBamSuffix | ${mergedBamSuffix} | string | The default suffix for merged bam files when they are created by Roddy. |
libloc_PSCBS | string | path to PSCBS library in R | |
libloc_flexclust | string | path to felxclust library in R |
Purity Evaluation¶
Depending on the sample, ACEseq might return multiple possible solutions. These solutions can be found in the file ${PatientID}_ploidy_purity_2D.txt within the ACEseq results directory.
In addition to the table the plot ${PatientID}_tcn_distance_start.png shows a distance matrix. This distance matrix indicates the estimated mean distance per ploidy/tcc combination based on the equations explained in the methods section. The optimally found minima are indicated by a star.

In case of multiple solutions, one can choose between the solution with the lowest distance or do as we suggest and choose the solution that is closest to diplot. It is recommended to make use of the prior knowledge about tumor biology as well as checking the final output as well as the mutant-allele frequency of a patient.

As can be seen below the benefit of increasing the ploidy of this sample to tetraploid leads to a clonal fit of multiple segments though many others remain subclonal (indicated by deviation from integer copy number states). This is often observed for heterogenous samples such as this shown lymphoma sample. Lymphoma tend to be diploid and heterogenous indicating that the diploid solution is correct. In addition we plotted the MAF distribution over balanced segments, that supports our assumption. Diploid solution:

Tetraploid solution:

MAF distribution over balanced segments:
Quality check¶
ACEseq provides a thorough quality check of the sample by investigation of the GC-bias: 1.differences in GC-bias between tumor and control 2.evenness of coverage in tumor and control
The following plot shows the raw GC bias for a healthy control (left), a corresponding tumor (middle) and the tumor/control ratio (right). The top row depicts raw data while the middle row indicates GC-bias corrected data and the bottom line indicates GC-bias and RT-bias corrected data.

The file ${pid}_qc_gc_corrected.json provides information about slope, curvature and their differences between tumor and control. A strong diffrence between tumor and control can impact sensitivity and specificty of other variant calls.
The full width half maximum (FWHM) indicates the noise level within the majority copy number state in a sample. If it exceeds 0.205 in the control or 0.34 in the tumor a sample should be flagged with a warning. Yellow flagged tumors might have decreased specificity and sensitiviy. Samples should be red flagged in case the FWHM exceed 0.29 in healthy controls of 0.54 in tumors. Red flagged tumor samples are very likely to accumulate artifacts due to a noisy coverage profile and should be excluded from further analysis Flagged controls can be rescued by rerunning the pipeline with “rerunWithFakeControl=true”. The following plot shows a sample with low FWHM (A and C) and a sample with noisy coverage and thus high FWHM (B and D). No good copy number estimates can be obtained from the high FWHM sample.

Final Output¶
A graphical presentation of the final results is given for each tcc/ploidy solution. A general overview is give for the whole genome as shown here and per chromosome. Points represent raw SNP data points, colored by their copy number with regard to the majority copy number state (red:loss, black: neutral, green: gain). Segments are indicated by dark and light blue lines for total and allele-specific copy number respectively. Raw BAF are shown in the bottom panel which can be used to evaluate tcc and confirm allele-specific copy numbers.

Final results are provided in two formats. A reduced set of information is contained in the file ${pid}_most_important_info_${ploidy}_${purity}.txt while the full set is presented in ${pid}_comb_pro_extra_${ploidy}_${purity}.txt. A mapping of both headers and a corresponding description is given here.
most_important_info | comb_pro_extra | description |
---|---|---|
chromosome | chromosome | |
start | start | start coordinate |
end | end | end coordinate |
SV.Type | crest | SV type connected to both or one breakpoint |
length | length | length of segment |
TCN | tcnMean | total copy number |
NbrOfHetSNPs | tcnNbrOfHets | number of control heterozygous SNPs in segment |
dhSNPs | dhMax | estimated DH |
minStart | minStart | last SNP prior to segment start |
maxStart | maxStart | first SNP after segment start |
minEnd | minStop | last SNP prior to segment end |
maxEnd | maxStop | first SNP after segment end |
covRatio | tcnMeanRaw | bias corrected coverage ratio |
dhEst | dhMean | raw DH |
c1Mean | c1Mean | minor allele copy number |
c2Mean | c2Mean | major allele copy number |
genotype | genotype | ratio of rounded allele copy numbers |
CNA.type | CNA.type | DUP/DEL/LOH/TCNneutral |
GNL | GNL | gain/loss/loh compared to diploid |
tcnNbrOfSNPs | number of SNPs per segment | |
tcnNbrOfLoci | number of SNPs per segment | |
dhNbrOfLoci | heterozygous SNPs per segment | |
map | mappable/unmappable | |
cluster | cluster assigned during merging | |
neighbour | indicates whether neighbouring segments exist on both sides | |
distDH | distance to main cluster center DH | |
errorSNP | error for DH direction | |
distTcn | distance to main cluster center coverage ratio | |
errorLength | error in coverage ratio direction | |
totalError | sum of errorSNP and errorLength | |
area | AUC ratio | |
peaks | 1 for balanced; 2 for imbalanced | |
meanCovT | average total coverage per cluster | |
meanCovB | average total coverage of B allele | |
AF | allelic factor | |
BAF | B-allele frequency | |
A | rounded minor allele copy number | |
B | rounded major allele copy number | |
TCN | rounded copy number | |
ploidy | majority copy number used as reference for CNA.type | |
Sex | patient sex | |
cytoband | cytoband |
HRD, LST, TAI¶
HRD, LST and TAI scores are provided in the file ${pid}_HRDscore_${ploidy}_$tcc} for each solution.
HRD and LST are merged based on smoothed segments as suggested by Popova et al. (doi: 10.1158/0008-5472.CAN-12-1470).
TAI are based on unmerged segments. A combination of the three scores might be useful.
Methods - Theory¶
Pre-processing¶
GC-/Replication timing bias correction¶
Correction for GC bias¶
As described in detail by Benjamini and Speed (REF) genomic regions with varying GC content may be sequenced at different depth due to selection bias or sequencing efficiency. Differing raw read counts in these regions even in the absence of copy number alterations can could lead to false positive calls. A GC-bias plot (Figure XY) can be used to visually inspect the bias of a sample. ACEseq first fits a curve to the data using LOWESS (locally weighted scatterplot smoothing, implemented in R) to identify the main copy number state first, which will be used to for a second fit. The second fit to the main copy number state is used for parameter assessment and correction of the data. This two-step fitting is necessary to compensate for large copy number changes that could lead to a misfit. The LOWESS fit as described above interpolates over all 10 kb windows. It thus averages over all different copy number states. If two states have their respective center of mass at different GC content, this first LOWESS fit might be distorted and not well suited for the correction. The full width half maximum (FWHM) of the density over all windows of the main copy number state is estimated for control and tumor. An usual large value here indicates quality issues with the sample.
Correction for replication time¶
The two bias correction steps described above are done sequentially. A simultaneous 2D LOWESS or LOESS correction would be desirable, but fails due to computational load (the clusters to be fitted have 106 points). Different parameters such as slope and curvature of the both LOWESS correction curves used are extracted. The GC curve parameters is used as quality measures to determine the suitability of the sample for further analysis whereas the replication timing curve parameters is used to infer the proliferation activity of the tumor. We could show a strong correlation between Ki-67 estimates and the slope of the fitted curve (Figure).
Segmentation¶
Segment reliability¶
Segment clustering and merging¶
Allelic adjustment¶
Calling of Allelic Balance and Imbalance¶



Copy Number Estimation¶





Purity and ploidy estimation¶
Final output¶
Once the optimal ploidy and tumor cell content combinations are found the TCN and allele-specific CN will be estimated for all segments in the genome and classified (gain, loss, copy-neutral LOH, loss LOH, gain LOH, sub). If a segments TCN is further than 0.3 away from an integer value it is assumed to originate from subpopulations in the tumor sample that lead to gains or losses in part of the tumor cell population.