This directory contains BED files of raw polyA sites generated within the GENCODE Capture Long-Seq project using PacBio reads. These were produced using samToPolyA (https://github.com/julienlag/samToPolyA) on the "*bothAdapters.bam" files listed in https://public_docs.crg.es/rguigo/Papers/2017_lagarde-uszczynska_CLS/data/BAMs/ using the following command: $ samtools view $BAMfile | samToPolyA.pl --minClipped=20 --minAcontent=0.9 --minUpMisPrimeAlength=10 --genomeFasta=${genome}-erccSpikeins.fa - |sortbed> All_Cap1__polyAsites.bed All files correspond to genome assemblies hg38 and mm10, and contain PacBio reads except otherwise stated. # File naming scheme: All_Cap1__.bed where: species: "mm": mouse "hs": human tissue: self-explanatory. subset: "polyAsites": all polyA sites, including those called on ERCC spike-in sequences (these are false positives) "polyAsitesNoErcc": all polyA sites, excluding those called on ERCC spike-in sequences. That is, only genomic polyA sites. # BED file format (BED6): There is one read per BED record. column 1: chromosome (or ERCC sequence identifier, if applicable) column 2: chromosome start of polyA site column 3: chromosome end of polyA site column 4: identifier of the read containing a polyA tail column 5: length of the polyA tail on read column 6: genomic strand of the read (inferred from the mapping of the read, i.e. reads where a polyA tail was detected at their 3’ end are assigned a ‘+’ genomic strand, whereas reads with a polyT tail at their 5’ end are deduced to originate from the ‘-’ strand.)