Class-presentation

Download Report

Transcript Class-presentation

CHAPTER 8

Managing and Curating Data

The Second Step

Storing and Curating Data

Storage: Temporary and Archival

Permanent archives  The only medium acceptable as truly archival is acid-free paper Electronic storage  Do not expect electronic media to last more than 5-10 years  Should be used primarily for working copies  If used, copy datasets onto newer electronic media on a regular basis

Curating Data

 Most ecological and environmental data are collected by researchers using funds obtained through grants and contracts They are technically owned by the granting agency, and they need to be made widely available (e.g., Internet)  Unfortunately, when budgets are cut, data management and curation costs are often the first items to be dropped

The Final Step

Transforming the Data

Transformation

 A mathematical function that is applied to all of the observations of a given variable

Y

*=

f

(Y)  Most are fairly simple algebraic functions as long as they are

continuous monotonic functions DO NOT

change the rank order of the data

DO

change relative spacing

Why Transform Data?

(1) Patterns in the data may be easier to understand and communicate than patterns in the raw data Converting curves into straight lines (2) Necessary for analysis to be valid – “meeting the assumptions”

The Species-Area Relationship

A classic example If we plot the number of species against the area of the island, the data often follow a simple power function, S=cA z where S = number of species A = is island area c and z are constants fitted to the data

The Species-Area Relationship

A classic example

Island

Albermarle Charles Chatham James Indefatigable Abingdon Duncan Narborough Hood Seymour Barrington Gardner Bindloe Jervis Tower Wenman Culpepper

Area (km 2 )

5824.9

165.8

505.1

525.8

1007.5

51.8

18.4

634.6

46.6

2.6

19.4

0.5

116.6

4.8

11.4

47 2.3

No. of species

325 319 306 47 42 22 14 7 224 193 119 103 80 79 52 48 48

Log 10 (Area)

3.765

2.220

2.703

2.721

3.003

1.714

1.265

2.803

1.668

0.415

1.288

-0.301

2.067

0.681

1.057

1.672

0.362

Log 10 (Species)

2.512

2.504

2.486

2.350

2.286

2.076

2.013

1.903

1.898

1.716

1.681

1.681

1.672

1.623

1.342

1.146

0.845

The Species-Area Relationship

400 300 200 100 0 0 1000 2000 3000 4000 Island Area (km 2 ) 5000 6000 7000

The Species-Area Relationship

If species richness and island area are related exponentially, we can transform this equation by taking logarithms of both sides log (S) = log (cA z ) log (S) = log (c) + zlog (A)

The Species-Area Relationship

2.6

2.4

2.2

2.0

1.8

1.6

1.4

1.2

1.0

0.8

0.6

-1 0 1 2 log 10 (Island Area) 3 4

Other Transformations

Cube-Root Transformation (Y 3 ) measures of mass or volume that are allometrically related to linear measures of body size or length Logarithmically transformed examines relationships between two measures of masses or volumes (Y 3 ), and transforms both X and Y

Why Transform Data?

Statistics Demands it All statistical tests require data to fit certain mathematical assumptions Examples

Analysis of Variance

(1) homoscedastic (2) residuals must be normal random variables

Regression

(1) normally-distributed residuals that are uncorrelated with the independent variable

Five Common Transformations

(1)Logarithmic Transformation (2)Square-root Transformation (3)Angular (or arcsine) Transformation (4)Reciprocal Transformation (5)Box-Cox Transformation

Logarithmic Transformation

Replaces each observation with its logarithm Y*=log (Y) Often equalizes variances for data which mean and variance are positively correlated, which also tend to have outliers with positively-skewed residuals Logarithm of 0 is not defined – add 1 to each observation

Square-root Transformation

Replaces each observation with its square root Y*=SQRT(Y) Used most frequently for count data, which often follows a Poisson distribution Yields a variance independent of mean Does not transform data values equal to 0 – add some small number to observations

Arcsine Transformation

Also Arcsine-square root or angular Replaces each observation with the arcsine of the square root of the value Y*=arcsine(SQRT(Y)) Principally used for proportions Removes the dependence of the variance on the mean Gives transformed data in units of radians, not degrees

Reciprocal Transformation

Replaces each value with its reciprocal Y*=1/Y Commonly used for data that records rates, which often appear as hyperbolic

Box-Cox Transformation

A family of transformations Y*=(Y lambda -1)/lambda Y*=log e (Y) (for lambda 0) (for lambda=0) L= -(v/2)log e (s 2 T )+(lambda-1)(v/n)sigma (log e Y) V=degrees of freedom N=sample size s 2 T =variance of transformed values of Y

Box-Cox Transformation

Y*=(Y lambda -1)/lambda Y*=log e (Y) (for lambda not equal to 0) (for lambda=0) L= -(v/2)log e (s 2 T )+(lambda-1)(v/n)sigma (log e Y) The value of lambda that results when the last equation is maximized is used in one of the first two equations to provide the closest fit of the transformed data to a normal distribution The last equation must be solved iteratively (trying different lambda values until L is maximized) using computer software

Box-Cox Transformation

Y*=(Y lambda -1)/lambda Y*=log e (Y) (for lambda not equal to 0) (for lambda=0) L= -(v/2)log e (s 2 T )+(lambda-1)(v/n)sigma (log e Y)  When lambda=1, equation 1 results in a linear transformation  When lambda=1/2, a square-root transformation  When lambda=-1, a reciprocal transformation  When lambda=0, equation 2 results in a natural logarithmic transformation  ALWAYS try using simple arithmetic transformations FIRST

Box-Cox Transformation

Y*=(Y lambda -1)/lambda Y*=log e (Y) (for lambda not equal to 0) (for lambda=0) L= -(v/2)log e (s 2 T )+(lambda-1)(v/n)sigma (log e Y)  ALWAYS try using simple arithmetic transformations FIRST  If data is right-skewed, try using familiar transformations from the series1/SQRT(Y), SQRT(Y), ln (Y), 1/Y  If left-skewed, try Y 2 , Y 3 , etc

1 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0 0.2

0.4

0.6

0.8

1 Original Logarithmic Square Root Arcsine Reciprocal

Reporting Results

 You should report results in the original units, which includes back-transforming the transformed values  Back-transformed mean will be very different from arithmetic mean  Also, back-transformations will normally result in asymmetrical confidence intervals

Back-Transformations

 Logarithmic – antilog(Y*) or

e

Y  Square Root – Y* 2  Arcsine – Sin(Y* 2 )  Reciprocal – 1/(Y*)

Reporting Results

 Lastly, transforming data should be added to your audit trail (documented in the metadata) Create a new spreadsheet and store it on permanent media