SOCS
SOCS requires three input files - a FASTA-formatted reference genome file, and the two files generated by the SOLiD system that contain raw color data and quality scores (xxx.csfasta and xxx_QV.qual, respectively). The program produces a list of best matches for mapped reads and six genome-wide sets of coverage scores as its output. The first two coverage maps correspond to the raw coverage scores for all sequence reads that can be mapped unambiguously (one for each strand), and a separate pair contains coverage scores for all reads that match multiple locations within the reference genome. The final pair of files contains a record of valid sequence-space variants encountered between the sequence data set and the reference genome (both frequency and identity).
NOTE: Reads should be filtered using filterReads.pl (available in scripts.tar) before running SOCS, or crashes may occur.


Options

Reference chromosome(s) Reference genome to map to, as a whole-genome nucleotide fasta or the name of a directory containing files with a ".fasta", ".fna", or ".fa" extension. Each file must only have one fasta tag, and a maximum of 256 files is allowed.
Color space file Fasta-like file containing color calls for a SOLiD run with primer bases (ie. XXXX.csfasta). Reads with uncalled colors should be removed using filterReads.pl.
Quality file Fasta file containing quality scores in the same order as the color space file (ie. XXX_QV.qual)
Mismatch tolerance The number of (color space) mismatches to allow a read to have with the reference genome. To detect sequence-space variants, a value of at least 2 must be used. Note that after 3, there is a significant time increase with each tolerance.
Target RAM usage (MB) By varying the number of reads considered at a time, SOCS will attempt to use approximately this much RAM. Using more RAM will generally result in faster mapping. However, using more RAM than is available will cause slow execution. We recommend using a target RAM of 80% of the available RAM.
Threads The program will spawn this many threads to perform the mapping. This can be used to take advantage of multiple cores or processors.
Iterative tolerance If "yes," (default) SOCS will map at lower tolerances before mapping at the specified tolerance. The specific tolerances will be chosen to optimize time with the given configuration. This option reduces runtime, but in rare cases the mapping chosen for a read will not be the optimal one. If "no", SOCS will map only at 0 and the maximum tolerance and will report the number matches that would not have been optimal with the option on.

Output

[Reference file or dir].stats.csv A summary of reads mapped and mismatches
[Reference file or dir].best A binary file used by SOCS for scoring. This can be used to generate the map, amb, and vmm files without mapping again by running SOCS with the "--score" argument. The file can be discarded if no longer needed.
[Reference file or dir].best.txt Best match descriptions for each mapped read. The format of each line is:

read number, chromosome, position, strand, mismatches

where read number is the order in which the read appears in the Color Space File (0-indexed), chromosome is an index keyed to file names at the top of the file, position is the smallest endpoint value regardless of strand (0-indexed), and mismatches is the number of color space mismatches in this match. If a read mapped ambiguously (to identical genomic substrings), all matches will be listed on separate lines.
[Chromosome file].[+/-].map Unambiguous coverage scores for each strand of each chromosome; one line per base
[Chromosome file].[+/-].amb Ambiguous coverage scores for each strand of each chromosome; one line per base
[Chromosome file].[+/-].vmm Valid sequence-space mismatches detected for each strand of each chromosome; one line per base; columns in A,C,G,T order. For example, "0 3 37 0" means that of the reads mapped, 3 had color space sequences signifying C at this base and 37 had color space sequences signifying G. To reduce noise from sequencing errors, only variants whose length is less than the mismatch tolerance are considered. Note that these are analogous to the "valid adjacent mismatches" in SOLiD GFF file. Ambiguously mapped reads will add mismatch scores to all matching regions.


Preference files

SOCS saves its options in socs.pref, which is created in the directory that SOCS is run in. This allows options for different data sets to be retained by running SOCS in the corresponding directories.

For batch runs, prompting can be bypassed by calling "socs [options.pref]", where options.pref lists the following options on separate lines:
Reference chromosome(s) (same as above)
Color space file (same as above)
Quality file (same as above)
Output base name Statistics and best-match files will append their extensions to this name.
Mismatch tolerance (same as above)
Target RAM usage (MB) (same as above)
Threads (same as above)
Iterative tolerance (same as above; "false" for yes and "true" for no)
Compute scores If "true", score files will be written. If "false", the binary file [output base name].best will be written. This file can be used by running "socs [options.pref] --score".
Read index offset When best-match files are written, read indeces will start at this value. This can be used for distributed processing.


Distributed processing

SOCS has been tested on an 8 node x 8 core cluster using Human genome sequence data mapped to build 36.3 (the color space file and quality file used are hosted at NCBI).

Each node must recieve a fraction of the reads to map. Each instance of SOCS will create .best files in addition to the .best.txt files. If coverage maps are desired, these must be combined to perform scoring (which is single-threaded). The following scripts are provided to facilitate distributed processing (although additional scripting may be required depending on the platform being used):
split.pl Splits the .csfasta and .qual files into a specified number of parts and creates corresponding preference files to be given to each instance of SOCS.
combine.pl Combines the .best and .best.txt files produced by each instance of SOCS and produces a preference file for scoring.
Download
Usage
About
Citation
Contact