This directory contains BED files of raw polyA sites generated within the GENCODE Capture Long-Seq project using PacBio reads.

These were produced using samToPolyA (https://github.com/julienlag/samToPolyA) on the "*bothAdapters.bam"
files listed in https://public_docs.crg.es/rguigo/Papers/2017_lagarde-uszczynska_CLS/data/BAMs/
using the following command:
$ samtools view $BAMfile | samToPolyA.pl --minClipped=20 --minAcontent=0.9 --minUpMisPrimeAlength=10 --genomeFasta=${genome}-erccSpikeins.fa - |sortbed> <species>All_Cap1_<tissue>_polyAsites.bed

All files correspond to genome assemblies hg38 and mm10, and contain PacBio reads except otherwise stated.


# File naming scheme:

<species>All_Cap1_<tissue>_<subset>.bed

where:
	species:
		"mm": mouse
		"hs": human

	tissue: self-explanatory.

	subset:
		"polyAsites": all polyA sites, including those called on ERCC spike-in sequences (these are false positives)
		"polyAsitesNoErcc": all polyA sites, excluding those called on ERCC spike-in sequences. That is, only genomic polyA sites.


# BED file format (BED6):

There is one read per BED record.

    column 1: chromosome (or ERCC sequence identifier, if applicable)
    column 2: chromosome start of polyA site
    column 3: chromosome end of polyA site
    column 4: identifier of the read containing a polyA tail
    column 5: length of the polyA tail on read
    column 6: genomic strand of the read (inferred from the mapping of the read,
              i.e. reads where a polyA tail was detected at their 3’ end are
	      assigned a ‘+’ genomic strand, whereas reads with a polyT tail at
	      their 5’ end are deduced to originate from the ‘-’ strand.)