Chapter 1 A guide for running the analyses scripts

The QC pipeline is organized into two main chapters: sequence-barcode association, and RNA and DNA quantification. Each chapter is accompanied by a dedicated script that allows the user to run all analyses presented in this book.

Below, we describe the files required to run the pipeline. Alongside each analysis, you will find the input files needed in order to run it. Example files format is provided below.

1.1 Associations

1.1.1 Script

association_analysis.py

1.1.2 Input

final_associations: BC - cCRE association file after all filtering.

barcode cCRE match_count
[str] [str] [int64]
BC sequence 1 cCRE ID 2 Number of observations (reads) for this BC-cCRE association
BC sequence 2 cCRE ID 2 Number of observations (reads) for this BC-cCRE association

associations_before_minimum_observations: BC - cCRE association file before filtering for a minimal number of unique BC-cCRE observations. Format is identical to final_associations

associations_before_promiscuity: BC - cCRE association file before filtering out BCs that are associated with multiple cCREs. Format is identical to final_associations

associations_downsampling_path: A path to a folder containing input files for the downsampling analysis. Format of these files is identical to final_associations

cCRE_fasta: Fasta file that includes all cCREs tested

1.1.3 Output

Output figures for the cCRE-BC association analyses, generated by the pipeline, are presented in the chapter: 2. Associations QC

1.2 RNA and DNA quantification

1.2.1 Script

activity_analysis.py

1.2.2 Input

activity_df: Each row represents a cCRE and includes activity data - statistic and p-value and RNA/DNA counts. This is the key file for the activity chapter

cCRE DNA_rep_comb RNA_rep_comb activity_status RNA_DNA_ratio_log_rep_comb activity_pval activity_statistic activity_FDR
[str] [float64] [float64] [str], Allowed values: ‘non_active’/‘active’ [float64] [float64] [float64] [float64]
cCRE ID 1 DNA read count across all replicates RNA read count across all replicates cCRE activity log RNA/DNA ratio across all replicates Statistic score p-value activity statistic adjusted p-value after FDR
cCRE ID 2 DNA read count across all replicates RNA read count across all replicates cCRE activity log RNA/DNA ratio across all replicates Statistic score p-value activity statistic adjusted p-value after FDR



activity_per_rep: RNA and DNA read counts data for each cCRE by replicates and combined

cCRE RNA_rep1 DNA_rep1 RNA_rep2 DNA_rep2 RNA_rep3 DNA_rep3 RNA_DNA_ratio_log_rep1 RNA_DNA_ratio_log_rep2 RNA_DNA_ratio_log_rep3
[str] [object], a list of integers [object], a list of integers [object], a list of integers [object], a list of integers [object], a list of integers [object], a list of integers [float64] [float64] [float64]
cCRE ID 1 RNA reads rep1 DNA reads rep1 RNA reads rep2 DNA reads rep2 RNA reads rep3 DNA reads rep3 RNA/DNA log ratio rep 1 RNA/DNA log ratio rep 2 RNA/DNA log ratio rep 3



cCRE_fasta: Fasta file that includes all cCREs tested



different_std_threshold_analysis: DNA and RNA counts after outlier filterings of several degrees of strictness.

ratio_log_{outlier_filter}_{rep} DNA_{outlier_filter}sum{rep}
[float64] [float64]
RNA/DNA ratio for each outlier filter parameter - replicate pair DNA read count for each outlier filter parameter - replicate pair
RNA/DNA ratio for each outlier filter parameter - replicate pair DNA read count for each outlier filter parameter - replicate pair



screen_df: Overlap of cCRE library with ENCODE SCREEN database of regulatory elements. Each row represents a cCRE and must have a SCREEN annotation of the following: Distal enhancer like sequence, DNase-only, Proximal enhancer like sequence, Heterochromatin, Promoter like sequence, DNase-H3K4me3. This file can be created using bedtools

activity_status activity_statistic class
[str],‘non_active’/‘active’ [float64] [str], Allowed values: ‘Proximal Enhancer’/‘Distal Enhancer’/‘Promoter’/‘Heterochromatin’/‘DNase-only’/‘DNase-H3K4me3’
cCRE activity cCRE activity statistic cCRE screen class overlap
cCRE activity cCRE activity statistic cCRE screen class overlap



tss_df: Distance of each cCRE from the nearest TSS, each row must include a numeric value that represents the distance. This file can be created using bedtools

activity_status activity_statistic log10_distance
[str], Allowed values: ‘non_active’/‘active’ [float64] [float64]
cCRE activity cCRE activity statistic cCRE distance from nearest TSS, log10
cCRE activity cCRE activity statistic cCRE distance from nearest TSS, log10



AI_df: Comparison of MPRA activity data with an AI model predictions for the same cCREs

cCRE exp: MPRA_activity AI: predicted_activity
[str] [float64] [float64]
cCRE ID 1 Experimental activity statistic AI-predicted activity statistic
cCRE ID 2 Experimental activity statistic AI-predicted activity statistic



AI_comparative_df: Same as above but for differential activity

id LFC - exp LFC - AI
[str] [float64] [float64]
cCRE ID 1 Experimental log fold change derived/ancestral AI-predicted log fold change derived/ancestral
cCRE ID 2 Experimental log fold change derived/ancestral AI-predicted log fold change derived/ancestral



downsampling_activity_path: A path for a folder that includes actvitiy_df for each sampling parameter



downsampling_ratio_path: A path for a folder that includes activity_per_rep for each sampling parameter



comparative_df: MPRA comparative results, each row represents a locus

seq_id logFC differentialy_active differential_activity_FDR
[str] [float64] [bool] [float64]
cCRE ID 1 logFC between the derived and ancestral alleles differential activity status p-value after FDR
cCRE ID 2 logFC between the derived and ancestral alleles differential activity status p-value after FDR



allelic_pairs_df: MPRA quantitative data, each row represents a locus and includes data for both alleles of the locus

|cCRE |allele1 |allele2 | |[str]|[float64]|[float64]| | cCRE ID 1 | Activity statistic of allele 1 | Activity statistic of allele 2 | | cCRE ID 2 | Activity statistic of allele 1 | Activity statistic of allele 2 |



cell_types_df: MPRA quantitative data, each row represents a cCRE and includes data for two different cell types

seq_id RNA_DNA_ratio_log_cell1 RNA_DNA_ratio_log_cell2
[str] [float64] [float64]
cCRE ID 1 RNA/DNA log ratio in cell type 1 RNA/DNA log ratio in cell type 2
cCRE ID 2 RNA/DNA log ratio in cell type 1 RNA/DNA log ratio in cell type 2



allelic_pairs_replicates_df: log2 RNA/DNA data for each locus, includes two alleles and their logFC

seq_id lfc_rep1 lfc_rep2
[str] [float64] [float64]
cCRE ID 1 Log fold change derived/ancestral rep1 Log fold change derived/ancestral rep2
cCRE ID 2 Log fold change derived/ancestral rep1 Log fold change derived/ancestral rep2



control_df: control annotation for each cCRE

cCRE cCRE type
[str] [str], Allowed values: ‘positive’/‘negative’/‘test’
cCRE ID 1 cCRE annotation
cCRE ID 2 cCRE annotation



reads_by_group: RNA reads for each cCRE by sample

cCRE {Sample}
[str] [str]
cCRE ID 1 cCRE RNA reads in {sample}
cCRE ID 2 cCRE RNA reads in {sample}



samples_metadata: Group annotation per sample

Sample Group
[str] [str]
Sample ID 1 Group annotation for sample 1
Sample ID 2 Group annotation for sample 2

1.2.3 Output

Output figures for the RNA and DNA quantification analyses, generated by the pipeline, are presented in the chapter: 3. Activity QC