GP3xCLI: GenePix Post-Processing Program for Quality Assessment of Raw Microarray Data from CSIRO Livestock Industries Antonio Reverter and Christina Pavlov Bioinformatics Group CSIRO Livestock Industries, Queensland Bioscience Precinct 306 Carmody Rd, St Lucia, QLD 4067, Australia ABSTRACT

We present GP3xCLI, an automated unsupervised AWK-based script to assess the quality of raw microarray data captured using the GenePix optical scanner. Input files are processed individually and, in the output, a 2-page portable document format (pdf) is being generated. Although the AWK interpreted programming language is the main driver for filtering and manipulating the raw data, GP3xCLI incorporates tools such as A2PS (a general purpose postscript generating utility), GNUPLOT (interactive plotting utility), and PS2PDF (a public domain postscript to pdf converter). On execution, GP3xCLI reports a series a summary statistics including total number of spots, anomalies due to background expression being larger than foreground, and distribution of records by genes or open reading frames.

Inaccurate microarray signals are further scrutinized by means of the percentage of data that is retained after each successive mean to median correlation elimination, as well as by the joint distribution of intensity ratios and average intensities. Finally, diagnostic plots, including the empirical densities of dye channel intensities and intensity ratios, are produced to enhance distinguishing among quality readings. GP3xCLI is intended to be incorporated within the server hosting the laboratory database where the users can invoke it remotely. Similar to GP3, an existing PERL-based program available at

http://www.bch.msu.edu/~zacharet/microarray/GP3.html

, GP3xCLI is not designed to process data for its subsequent analysis, but rather to provide the Biologists with a simple, intuitive and effective means of assessing microarray data quality.

AWK Script: echo " =-=-=-=-=-=-= INITIALIZATION =-=-=-=-=-=-=-="

filename=`ls -l $1 | awk '{print $NF}'` echo "GPR Input:" $filename | awk '{print $1, $2, " ", $3}' date | awk '{print "Processed on:", $1, $2, $3, $4, $5, $6}' sed 's/\"//g' $1 | \ awk 'NF==43 && $1==int($1) && $2==int($2) && $3==int($3) \ {print $0}’ > tempo0

echo " =-=-=-=-=-=-= IMAGE QUALITY =-=-=-=-=-=-=-="

T=`wc tempo0 | awk '{print $1}'` echo "Total No. of Spots ------------------------>" $T N=àwk ‘$NF==-50 {print $0}' tempo0 | wc | awk '{print $1}'` echo "Spots with Flag = -50 -------------------->" $N N=àwk ‘$NF==-100 {print $0}' tempo0 | wc | awk '{print $1}'` echo "Spots with Flag = -100 -------------------->" $N N=àwk ‘$12>=$9 {print $0}' tempo0 | wc | awk '{print $1}'` echo "Red dye with Background >= Foreground --->" $N N=àwk ‘$21>=$18 {print $0}' tempo0 | wc | awk '{print $1}'` echo "Green dye with Background >= Foreground --->" $N

echo " =-=-=-= MEAN TO MEDIAN CORRELATION =-=-=-="

awk ‘{print $9, $10, $18, $19, log($9)/log(2), log($10)/log(2), \ log($18)/log(2), log($19)/log(2)}' tempo0 > rg awk '$1>$2 {$9=$2/$1}; $1<=$2 {$9=$1/$2}; {print $9}' rg > rr awk '$3>$4 {$9=$4/$3}; $3<=$4 {$9=$3/$4}; {print $9}' rg > gr awk '$5>$6 {$9=$6/$5}; $5<=$6 {$9=$5/$6}; {print $9}' rg > rl awk '$7>$8 {$9=$8/$7}; $7<=$8 {$9=$7/$8}; {print $9}' rg > gl

############################################################## # GP3xCLI # # GenePix Processing Program by CSIRO Livestock Industries # # # # Enquiries: [email protected] # # Copyright (c) 2003 CSIRO-LI # ############################################################## GPR Input: F12.gpr Processed on: Tue Apr 8 13:40:01 EST 2003 =-=-=-=-=-=-= IMAGE QUALITY =-=-=-=-=-=-=-= Total No. of Spots ------------------------> 19200 Spots with Flag = -50 --------------------> 4720 Spots with Flag = -100 --------------------> 12 Red dye with Background >= Foreground ---> 892 Green dye with Background >= Foreground ---> 915 Median to Mean Correlation Analysis: DATA LEFT RED GREEN Corr Raw Log2 Raw Log2 ______________________________________ > 0.00 19200 19200 19200 19200 > 0.20 19199 19200 19199 19200 > 0.40 19183 19200 19192 19200 > 0.60 19008 19200 19102 19200 > 0.80 17061 19199 18541 19198 > 0.85 14466 19193 17872 19196 > 0.90 10491 19137 15786 19181 =-=-=-=-=-=-= VALID SPOTS* =-=-=-=-=-=-=-= Total No. of Valid Spots -----------------> 14433 Percentage of Valid Spots -----------------> 75.2

Total No. of Genes ------------------------> 7220 Mean No. Repetitions -----> 2 for 6600 Genes Min. No. Repetitions -----> 1 for 580 Genes Max. No. Repetitions -----> 24 for 8 Genes Log(R/G) vs 0.5*Log(R*G) ________ ____________ N 14433 14433 Mean -0.017 10.327

Std 0.617 2.079

Min -8.711 3.246

Max 4.030 15.994

Correlation 0.362

Log(R/G) across Intensity Values Intensity Spots % <0 % >0 __________________________________ ( 0 , 4) 4 100.0 0.0

( 4 , 8) 1499 74.1 25.9

( 8 , 12) 9847 40.4 59.6

(12 , 16) 3083 17.3 82.7

__________________________________ *NB: Valid Spot defined as spots with Background < Foreground for both Red and Green channels and with a Quality Flag of 0.

AWK Script: (cont.)

for minr in 0 0.2 0.4 0.6 0.8 0.85 0.9

do T1=àwk -v corr=$minr '$1>corr {print $0}' rr | wc | awk '{print $1}'` T2=àwk -v corr=$minr '$1>corr {print $0}' rl | wc | awk '{print $1}'` T3=àwk -v corr=$minr '$1>corr {print $0}' gr | wc | awk '{print $1}'` T4=àwk -v corr=$minr '$1>corr {print $0}' gl | wc | awk '{print $1}'` echo "> " $minr $T1 $T2 $T3 $T4 done

**echo " =-=-=-= Log(R/G) vs 0.5Log(RG =-=-=-=-="**

awk '{print $3, $4}' rgma | awk '{ v1[NR]=$1; v2[NR]=$2}; \ END{ min1=min2=99999; max1=max2=-99999; \ for(i=1;i<=NR;i++){ if( v1[i] < min1 ) min1 = v1[i]; \ if( v2[i] < min2 ) min2 = v2[i]; if( v1[i] > max1 ) max1 = v1[i]; \ if( v2[i] > max2 ) max2 = v2[i]; s1 += v1[i]; ss1 += v1[i]*v1[i]; \ s2 += v2[i]; ss2 += v2[i]*v2[i]; ss12 += v1[i]*v2[i] }; \ mean1 = s1/NR; mean2 = s2/NR; \ std1 = sqrt(( ss1 - (s1*s1)/NR ) / (NR-1)); \ std2 = sqrt(( ss2 - (s2*s2)/NR ) / (NR-1)); \ num = ( ss12 - (s1*s2)/NR ) / (NR-1); \ den = std1 * std2; corr = num / den; \ printf"%10s%11d%17d\n","N",NR,NR; \ printf"%10s%11.3f%17.3f\n","Mean",mean1,mean2; \ printf"%10s%11.3f%17.3f\n","Std",std1,std2; \ printf"%10s%11.3f%17.3f\n","Min",min1,min2; \ printf"%10s%11.3f%17.3f\n","Max",max1,max2; \ printf"%18s%10.3f\n","Correlation",corr}‘

echo " =-=-=-= EMPIRICAL DENSITIES =-=-=-="

awk '{print log($1)/log(2)}' rgma | sort -n | \ awk '{ data[NR] = $1 }; \ END { min = data[1]; max = data[NR]; range = max - min; \ n_int = 1000; if( int(NR*.1) <= n_int ) n_int = int(NR*.1); \ size = range / n_int; \ for(i=1; i<=NR; i++){ tot += data[i]; \ aux = int((data[i] - min)/size) + 1; \ q[aux]++; \ }; \ mn_int = min + size/2; \ for(i=1; i<=n_int; i++){if( q[i] < 1 ) q[i] = 0; \ print mn_int, q[i]; \ mn_int += size } \ }' > logr.d

No Slide Title

Transcript No Slide Title