This directory contains BED files of merged polyA sites generated within the GENCODE Capture Long-Seq project using PacBio reads.

These were produced by merging "*polyAsitesNoErcc.bed" files listed in https://public_docs.crg.es/rguigo/Papers/2017_lagarde-uszczynska_CLS/data/polyA/raw/ using the following command:

$ cat $RAW_BEDfile | bedtools merge -s -n -d 5 -nms -i stdin | awk '$5>1' | perl -F"\t" -lane 'if($F[5] eq "+"){$F[1]=$F[2]-1}elsif($F[5] eq "-"){$F[2]=$F[1]+1}else{die} print join("\t",@F);'|sortbed > <species>All_Cap1_<tissue>_<subset>.clusters.bed


All files correspond to genome assemblies hg38 and mm10, and contain PacBio reads except otherwise stated.


# File naming scheme:

<species>All_Cap1_<tissue>_<subset>.bed

where:
	species:
		"mm": mouse
		"hs": human

	tissue: self-explanatory, except:
		"all": all polyA sites merged across all tissues.

	subset:
		"polyAsitesNoErcc": all polyA sites, excluding those called on ERCC spike-in sequences.


# BED file format (BED6):

There is one read per BED record.

    column 1: chromosome 
    column 2: chromosome start of polyA site
    column 3: chromosome end of polyA site
    column 4: comma-separated list of read identifiers contributing to the site
    column 5: number of reads contributing to the site
    column 6: genomic strand of the site