Chapter 1 A guide for running the analyses scripts

The QC pipeline is organized into two main chapters: sequence-barcode association, and RNA and DNA quantification. Each chapter is accompanied by a dedicated script that allows the user to run all analyses presented in this book.

Below, we describe the files required to run the pipeline. Alongside each analysis, you will find the input files needed in order to run it. Example files format is provided below.

1.1 Associations

1.1.1 Script

association_analysis.py

1.1.2 Input

final_associations: BC - cCRE association file after all filtering.

barcode	cCRE	match_count
[str]	[str]	[int64]
BC sequence 1	cCRE ID 2	Number of observations (reads) for this BC-cCRE association
BC sequence 2	cCRE ID 2	Number of observations (reads) for this BC-cCRE association

associations_before_minimum_observations: BC - cCRE association file before filtering for a minimal number of unique BC-cCRE observations. Format is identical to final_associations

associations_before_promiscuity: BC - cCRE association file before filtering out BCs that are associated with multiple cCREs. Format is identical to final_associations

associations_downsampling_path: A path to a folder containing input files for the downsampling analysis. Format of these files is identical to final_associations

cCRE_fasta: Fasta file that includes all cCREs tested

1.1.3 Output

Output figures for the cCRE-BC association analyses, generated by the pipeline, are presented in the chapter: 2. Associations QC

1.2 RNA and DNA quantification

1.2.1 Script

activity_analysis.py

1.2.2 Input

activity_df: Each row represents a cCRE and includes activity data - statistic and p-value and RNA/DNA counts. This is the key file for the activity chapter

cCRE	DNA_rep_comb	RNA_rep_comb	activity_status	RNA_DNA_ratio_log_rep_comb	activity_pval	activity_statistic	activity_FDR
[str]	[float64]	[float64]	[str], Allowed values: ‘non_active’/‘active’	[float64]	[float64]	[float64]	[float64]
cCRE ID 1	DNA read count across all replicates	RNA read count across all replicates	cCRE activity	log RNA/DNA ratio across all replicates	Statistic score p-value	activity statistic	adjusted p-value after FDR
cCRE ID 2	DNA read count across all replicates	RNA read count across all replicates	cCRE activity	log RNA/DNA ratio across all replicates	Statistic score p-value	activity statistic	adjusted p-value after FDR

activity_per_rep: RNA and DNA read counts data for each cCRE by replicates and combined

cCRE	RNA_rep1	DNA_rep1	RNA_rep2	DNA_rep2	RNA_rep3	DNA_rep3	RNA_DNA_ratio_log_rep1	RNA_DNA_ratio_log_rep2	RNA_DNA_ratio_log_rep3
[str]	[object], a list of integers	[object], a list of integers	[object], a list of integers	[object], a list of integers	[object], a list of integers	[object], a list of integers	[float64]	[float64]	[float64]
cCRE ID 1	RNA reads rep1	DNA reads rep1	RNA reads rep2	DNA reads rep2	RNA reads rep3	DNA reads rep3	RNA/DNA log ratio rep 1	RNA/DNA log ratio rep 2	RNA/DNA log ratio rep 3

cCRE_fasta: Fasta file that includes all cCREs tested

different_std_threshold_analysis: DNA and RNA counts after outlier filterings of several degrees of strictness.

ratio_log_{outlier_filter}_{rep}	DNA_{outlier_filter}sum{rep}
[float64]	[float64]
RNA/DNA ratio for each outlier filter parameter - replicate pair	DNA read count for each outlier filter parameter - replicate pair
RNA/DNA ratio for each outlier filter parameter - replicate pair	DNA read count for each outlier filter parameter - replicate pair

screen_df: Overlap of cCRE library with ENCODE SCREEN database of regulatory elements. Each row represents a cCRE and must have a SCREEN annotation of the following: Distal enhancer like sequence, DNase-only, Proximal enhancer like sequence, Heterochromatin, Promoter like sequence, DNase-H3K4me3. This file can be created using bedtools

activity_status	activity_statistic	class
[str],‘non_active’/‘active’	[float64]	[str], Allowed values: ‘Proximal Enhancer’/‘Distal Enhancer’/‘Promoter’/‘Heterochromatin’/‘DNase-only’/‘DNase-H3K4me3’
cCRE activity	cCRE activity statistic	cCRE screen class overlap
cCRE activity	cCRE activity statistic	cCRE screen class overlap

tss_df: Distance of each cCRE from the nearest TSS, each row must include a numeric value that represents the distance. This file can be created using bedtools

activity_status	activity_statistic	log10_distance
[str], Allowed values: ‘non_active’/‘active’	[float64]	[float64]
cCRE activity	cCRE activity statistic	cCRE distance from nearest TSS, log10
cCRE activity	cCRE activity statistic	cCRE distance from nearest TSS, log10

AI_df: Comparison of MPRA activity data with an AI model predictions for the same cCREs

cCRE	exp: MPRA_activity	AI: predicted_activity
[str]	[float64]	[float64]
cCRE ID 1	Experimental activity statistic	AI-predicted activity statistic
cCRE ID 2	Experimental activity statistic	AI-predicted activity statistic

AI_comparative_df: Same as above but for differential activity

id	LFC - exp	LFC - AI
[str]	[float64]	[float64]
cCRE ID 1	Experimental log fold change derived/ancestral	AI-predicted log fold change derived/ancestral
cCRE ID 2	Experimental log fold change derived/ancestral	AI-predicted log fold change derived/ancestral

downsampling_activity_path: A path for a folder that includes actvitiy_df for each sampling parameter

downsampling_ratio_path: A path for a folder that includes activity_per_rep for each sampling parameter

comparative_df: MPRA comparative results, each row represents a locus

seq_id	logFC	differentialy_active	differential_activity_FDR
[str]	[float64]	[bool]	[float64]
cCRE ID 1	logFC between the derived and ancestral alleles	differential activity status	p-value after FDR
cCRE ID 2	logFC between the derived and ancestral alleles	differential activity status	p-value after FDR

allelic_pairs_df: MPRA quantitative data, each row represents a locus and includes data for both alleles of the locus

cell_types_df: MPRA quantitative data, each row represents a cCRE and includes data for two different cell types

seq_id	RNA_DNA_ratio_log_cell1	RNA_DNA_ratio_log_cell2
[str]	[float64]	[float64]
cCRE ID 1	RNA/DNA log ratio in cell type 1	RNA/DNA log ratio in cell type 2
cCRE ID 2	RNA/DNA log ratio in cell type 1	RNA/DNA log ratio in cell type 2

allelic_pairs_replicates_df: log2 RNA/DNA data for each locus, includes two alleles and their logFC

seq_id	lfc_rep1	lfc_rep2
[str]	[float64]	[float64]
cCRE ID 1	Log fold change derived/ancestral rep1	Log fold change derived/ancestral rep2
cCRE ID 2	Log fold change derived/ancestral rep1	Log fold change derived/ancestral rep2

control_df: control annotation for each cCRE

cCRE	cCRE type
[str]	[str], Allowed values: ‘positive’/‘negative’/‘test’
cCRE ID 1	cCRE annotation
cCRE ID 2	cCRE annotation

reads_by_group: RNA reads for each cCRE by sample

cCRE	{Sample}
[str]	[str]
cCRE ID 1	cCRE RNA reads in {sample}
cCRE ID 2	cCRE RNA reads in {sample}

samples_metadata: Group annotation per sample

Sample	Group
[str]	[str]
Sample ID 1	Group annotation for sample 1
Sample ID 2	Group annotation for sample 2

1.2.3 Output

Output figures for the RNA and DNA quantification analyses, generated by the pipeline, are presented in the chapter: 3. Activity QC