Ryan's Blog

Matplotlib: black background

Posted in research by ryanlayer on February 18, 2014

matplotlib.rcParams.update({'font.size': 12})
fig = matplotlib.pyplot.figure(figsize=(5,5),dpi=300,facecolor='black')
ax = fig.add_subplot(1,1,1,axisbg='k')


ax.tick_params(axis='x', colors='white')
ax.tick_params(axis='y', colors='white')

ax.tick_params(axis='both', direction='in')

ax.set_xlabel('X-Axis Label Here')
ax.set_ylabel('Y-Axis Label Here')
ax.set_title('Title Here')



Here for a Matplotlib color names.

Tagged with: , ,

General Papers

Posted in research by ryanlayer on November 22, 2010

Next-generation gap



Tagged with:

Local Installation and Use of R Packages

Posted in research by ryanlayer on June 1, 2010


1. Specifying a local library search location

Specify a local library search location.

You can use several library trees of add-on packages. The easiest way to tell R to use these via a ‘dotfile’ by creating the following file ‘$HOME/.Renviron’ (watch the quotes and ~ character):


This specifies a keyword (R_LIBS_USER) which points to a colon-separated list of directories at which R library trees are rooted. You do not have to specify the default tree for R packages.

If necessary, create a place for your R libraries

  mkdir ~/R ~/R/library         # Only need do this once

Set your R library path

  echo 'R_LIBS_USER="~/R/library"' >  $HOME/.Renviron

2. Installing to a local library search location

Installation is dead easy. Start up R and tell R to fetch your package from CRAN, compile whatever needs compiling and set everything else up.

Beware – each package will only work for the platform (i.e. Linux or Solaris) where you installed it. If you want a package on both Linux and Solaris, you’ll need to install it in different directories for each system type.

  R                     # Invoke R
  > install.packages("name-of-your-package",lib="~/R/library")
Tagged with:

Novoalign Alignment Scores

Posted in research by ryanlayer on May 31, 2010

Base Qualities and Alignment Scores

Novoalign aligns reads against a reference genome using qualities and ambiguous nucleotide codes.
The initial alignment process finds alignment locations in the indexed sequence that are possible
sources of the read sequence. The alignment locations are scored using the Needleman­Wunsch
algorithm with affine gap penalties and with position specific scoring derived from the read base
qualities and any ambiguous codes in the reference sequence. User defined affine gap penalties are
used for scoring insert/deletes.
Novoalign uses Needleman­Wunsch alignments with affine gap penalties, the gap opening penalty
should be set to -10\log_{10}(P_{gap}) - G_{extend} where P_{gap} is the probability of an insertion deletion
mutation vs the reference genome and G_{extend} is the gap extension penalty. Likewise the gap extend
penalty can be set to -10\log_{10}(P_{gap2} / P_{gap1}) where P_{gap1} is the probability of a single base indel
and P_{gap2} is the probability of a 2 base insert/delete mutation. The default gap penalties were
derived from the frequency of short insert/deletes in human genome resequencing projects.
Base quality values are used to calculate base penalties for the Needleman ­Wunsch algorithm. The
base qualities are converted to base probabilities and then to score penalties.

PRB Quality to Score Conversion

The prb file has quality score Q(b,i) for each base, b, at each position, i, in the read. The quality
value is converted to a probability, Pr(b,i) and then to a penalty P(b, i).

Pr(b,i) = \frac{10^{\frac{Q(b,i)}{10}}}{1 + 10^{\frac{Q(b,i)}{10}}}

P(b,i) = -10\log_{10}(Pr(b,i))

Alignment Score and Threshold

The alignment score is -10\log_{10}(P(R|A_i)) where P(R | A_i) is the probability of the read sequence
given the alignment location i.

A threshold of 75 would allow for alignment of reads with two mismatches at high quality base
positions plus one or two mismatches at low quality positions or to ambiguous characters in the
reference sequence.
If a threshold is not specified then Novoalign will calculate a threshold for each read such that an
alignment to a non­repetitive sequence will have an alignment quality of at least 20. I.e. The
iterative process of finding an alignment will terminate before finding a low quality chance
alignment. Alignments to repetitive sequences may still have qualities less than 20.

Posterior Alignment Probabilities and Quality Scores

The posterior alignment probability calculation includes all the alignments found; the probability
that the read came from a repeat masked region or from any regions coded in the reference genome
as N’s; and an allowance for a chance hit above the threshold based on the mutual information
content of the read and the genome.
A posterior alignment probability, P(A_i| R, G) is calculated as:

P(A_i| R, G) = \frac{P(R|A_i, G)}{P(R|N, G) + \sum P(R|A_i,B)}

where P(R|N,G) is the probability of finding the read by chance in any masked reference sequence
or   any   region   of   the   reference   sequence   coded   as   N‘s,   and   where  \sum i is the sum over all the
alignments found plus a factor for chance alignments calculated using the usable read and genome
The P(R|N,G) term allows for the fact that a fragment could have been sourced from portions of the
genome that are not represented in the reference sequence. For instance in Human genome build 36
there is approximately 7% of sequence represented by large blocks of N‘s.
A quality score is calculated as -10\log_{10}(1 - P(A_i | R, G)), where P(A_i|R, G) is the probability of the
alignment given the read and the genome.

Tagged with: ,

Mus musculus (laboratory mouse) Chromosome

Posted in research, Uncategorized by ryanlayer on May 31, 2010
GenBank id chr length
NC_000067 chr1 197195432
NC_000068 chr2 181748087
NC_000069 chr3 159599783
NC_000070 chr4 155630120
NC_000071 chr5 152537259
NC_000072 chr6 149517037
NC_000073 chr7 152524553
NC_000074 chr8 131738871
NC_000075 chr9 124076172
NC_000076 chr10 129993255
NC_000077 chr11 121843856
NC_000078 chr12 121257530
NC_000079 chr13 120284312
NC_000080 chr14 125194864
NC_000081 chr15 103494974
NC_000082 chr16 98319150
NC_000083 chr17 95272651
NC_000084 chr18 90772031
NC_000085 chr19 61342430
NC_000086 chrX 166650296
NC_000087 chrY 15902555
Tagged with:

BED File Format

Posted in research by ryanlayer on January 14, 2010

BED format provides a flexible way to define the data lines that are displayed in an annotation track. BED lines have three required fields and nine additional optional fields. The number of fields per line must be consistent throughout any single set of data in an annotation track. The order of the optional fields is binding: lower-numbered fields must always be populated if higher-numbered fields are used.

If your data set is BED-like, but it is very large and you would like to keep it on your own server, you should use the bigBed data format.

The first three required BED fields are:

  1. chrom – The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
  2. chromStart – The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
  3. chromEnd – The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.

The 9 additional optional BED fields are:

  1. name – Defines the name of the BED line. This label is displayed to the left of the BED line in the Genome Browser window when the track is open to full display mode or directly to the left of the item in pack mode.
  2. score – A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). This table shows the Genome Browser’s translation of BED score values into shades of gray:
    score in range ≤ 166 167-277 278-388 389-499 500-611 612-722 723-833 834-944 ≥ 945
  3. strand – Defines the strand – either ‘+’ or ‘-‘.
  4. thickStart – The starting position at which the feature is drawn thickly (for example, the start codon in gene displays).
  5. thickEnd – The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).
  6. itemRgb – An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to “On”, this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser.
  7. blockCount – The number of blocks (exons) in the BED line.
  8. blockSizes – A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.
  9. blockStarts – A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.

See Example 3 for a demonstration of a custom track that uses a complete BED12 definition.

Example 5:
This example shows an annotation track that uses the itemRgb attribute to individually color each data line. In this track, the color scheme distinguishes between items named “Pos*” and those named “Neg*”. See the usage note in the itemRgb description above for color palette restrictions. NOTE: The track and data lines in this example have been reformated for documentation purposes. Click here for a copy of this example that can be pasted into the browser without editing.

browser position chr7:127471196-127495720
browser hide all
track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2
chr7	127471196  127472363  Pos1  0  +  127471196  127472363  255,0,0
chr7	127472363  127473530  Pos2  0  +  127472363  127473530  255,0,0
chr7	127473530  127474697  Pos3  0  +  127473530  127474697  255,0,0
chr7	127474697  127475864  Pos4  0  +  127474697  127475864  255,0,0
chr7	127475864  127477031  Neg1  0  -  127475864  127477031  0,0,255
chr7	127477031  127478198  Neg2  0  -  127477031  127478198  0,0,255
chr7	127478198  127479365  Neg3  0  -  127478198  127479365  0,0,255
chr7	127479365  127480532  Pos5  0  +  127479365  127480532  255,0,0
chr7	127480532  127481699  Neg4  0  -  127480532  127481699  0,0,255


Tagged with: ,

Getting Sizes Right for CUDA

Posted in programming, research by ryanlayer on January 12, 2010

Block Elements:  this depends on the size of the block’s local memory (LOCAL_MEM) and the size of each element in the data structure (sizeof(int)).  For the Tesla card there is 16K of space so LOCAL_MEM=16384.  16384/4 = 4096

Block Size (number of threads per block):  Each thread is responsible for one window.  The number of threads per block depends on the block elements, the window size, and the slide size.  (block_elements – window_length) / slide_length. (4096 – 200)/50 = 77

Usable Threads:  Each thread is responsible for loading the first slide_length of its range from global memory to local memory.  This means that some threads at the end of the block will not have all of their data load.  There will be window_length/slide_length unusable threads, and block_size – (window_length/slide_length) usable threads.

NOTE:  blockDim.x and window_length/slide refer to number of windows (how many windows in a block, and how many unusable windows/threads per block).  To convert to position we often must multiply the number of windows by the slide_length.  For example, window 5 will start at position 5*slide_length.

Each block starts at position blockIdx.x*(blockDim.x – window_length/slide_length)*slide_length

blockIdx.x is the block ID, if blocks did not need to overlap, then we would just multiply this by blockDim.x*slide_length (blockDim.x refers to the number of windows, and we need position, so we multiply by slide_length).  Since things overlap, we need each block (after the first one) to start a few positions back.  The number of unusable windows at the end of each block is equal to window_length/slide_length.  The next block needs to cover these windows.  Block sizes are fixed, so blocks that are moved back to cover unused windows will leave some amount of unprocessed windows that must be covered by the next block (in addition to the unusable windows).  Block 1 needs to be moved back window_length/slide_length to cover the unusable windows in block 0; block 2 needs to be moved back 2*(window_length/slide_length) to cover both the unprocessed space and the unusable windows; block 3 needs to be moved back 3*(window_length/slide_length); and so forth.  The amount a block must be moved back is blockIdx.x*window_length/slide_length, and therefore each block starts at  blockIdx.x(blockDim.x – window_length/slide_length)slide_length.

Each thread, which corresponds to a window, starts at an offset from where the block starts, that offset is based on the slide size: threadIdx.x*slide_length + blockIdx.x(blockDim.x – window_length/slide_length)slide_length

Number of Blocks:  block_elements/chrom_size would be correct if there was no overlapping, but blocks must overlap to account for the unusable threads

Structural Variation Graph (sv graph) Thoughts

Posted in programming, research by ryanlayer on January 5, 2010

The data structure consists of:

– a set of chromosomes

– each chromosome is represented by a name, and an ordered (doublely linked) list of nodes

– each node represents one tag of a pair

– each node has a double sided link to the next node in chromosome order (the node with the next largest offset), and a list of nodes in which the node is part of a pair.  The links to the pairs are one-way.

The chromosome name is not part of the node struct.  Each node does have a pointer back to it’s chromosome structure, and that chromosome struct contains the name.  This prevents a potentially long chromosome name from being stored a ton of time (in each node).  When we are reading the nodes from a file, we must pass in a char pointer so that the name of the chromosome can be set.  We will use this char pointer to put the node into the proper chromosome structure.

Tagged with: ,

Reading mapped files and fastq files

Posted in programming, research by ryanlayer on December 2, 2009

When reading the files, we deal with two types, three files in total.  Two of the files (.out files) are the result of mapping the fastq files to the reference.  Each file represents all tags on one side of the pair.  The fastq file can represent either side of the pair.

We assume that all files are ordered.

For each file, we want to extract the entries.

In the mapped file, we ignore any line that begins with ‘#’, or has a status other than ‘U’ which indicates it is a uniq mapping.  While we are reading the mapped files, each line read can have one of three results and will return one of three integer values:

  1. valid entry (1)
  2. invalid entry (status != ‘R’, ^ = ‘#’, etc.) (0)
  3. end of file (-1)

These different return values allow us to loop for the next valid entry and stop looping when we reach the end:

while(read_line(mapped.file) == 0) {;} will loop until we have a valid entry or have reached the end of the file.

In the fastq file, there are 4 lines of data per tag, we are only concerned with the first, so we simply skip three lines between entry reads.

Tagged with: , , , , ,

Papers about malware black markets

Posted in research by ryanlayer on November 9, 2009

Nobody Sells Gold for the Price of Silver:Dishonesty, Uncertainty and the Underground Economy [PDF]

Cormac Herley and Dinei Flor encio
Microsoft Research One Microsoft Way Redmond, WA, USA
c.herley@ieee.org, dinei@microsoft.com

Organised crime groups in cyberspace: a typology [PDF]

Kim-Kwang Raymond Choo
Australian Institute of Criminology, GPO Box 2944, Canberra, ACT 2601, Australia


Botnet Economics: Uncertainty Matters [PDF]

Zhen Li
Department of Economics and Management, Albion College
Qi Liao, Aaron Striegel
Department of Computer Science and Engineering, University of Notre Dame

Learning More about the Underground Economy: A Case-Study of Keyloggers and Dropzone [PDF]

Thorsten Holz, Markus Engelberth, and Felix Freiling
Laboratory for Dependable Distributed Systems, University of Mannheim, Germany
Secure Systems Lab, Vienna University of Technology, Austria

All your iFRAMEs point to Us [PDF]

Niels Provos Panayiotis Mavrommatis Moheeb Abu Rajab Fabian Monrose
Google, JHU

Tagged with: , ,