Ron Jones Logo

Contact RJ

Ron Jones Bio
CorporateWellness
Coach & Train
Exercise Library
Handouts
Health & Fitness
KETTLEBELLS
Products by RJ
Site Map

RJ Foot Fitness Logo

TheLeanBerets.Com "Avengers of Health!"

Coach RJ Blog

KIN 605: Reliability and Validity of Measurement

Measurement:  All measures have true score plus error. 

·        Accuracy:  Use methods valid and reliable.  Use pilot tests to evaluate validity and reliability of new instruments.  Go to original sources.

o       If a measurement is accurate, then it is both valid and reliable i.e. it is “consistently on target.” 

·        Systematic Error or Bias:  Due to something in the environment that alters performance in a given or systematic direction e.g. elevated temperature will lead to increased HR.

·        Random Error or “Noise”:  Mood of the participants e.g. some will be good and some will be bad.

o       Affect on the group mean will NOT tend to be in one direction therefore cancels out. 

·        Minimize Error: Conduct pilot tests (ask for feedback about environment and difficulty; evaluate accuracy of equipment).  Train testers/technicians.  Double check data at initial recording, when entered into computer, at every opportunity!

Measurement & Affective Behavior: (p. 194)

·        Likert Scale: Type of closed question that requires subject to respond by choosing one of several scaled responses; the intervals between items are assumed to be equal.

·        Semantic Differential Scale: Used to measure affective behavior in which the respondent is asked to make judgments about certain concepts by choosing one of seven intervals between bipolar adjectives i.e. the coach is “creative to unoriginal” on a 7-point scale.

Measurement Error: (p. 185) ME results from four sources.

1.      Participant: (mood, motivation, fatigue, health, fluctuations in memory and performance, previous practice, specific knowledge, familiarity with test items)

2.      Testing:  (lack of clarity or completeness in directions, how rigidly directions are followed, whether supplementary directions or motivation is applied)

3.      Scoring:  (competence, experience, dedication of the scorers and to the nature of scoring itself)

4.      Instrumentation:  (inaccuracy and lack of calibration of mechanical and electronic equipment, inadequacy of a test to discriminate between abilities and to the difficulty of scoring some tests)

Measurement Error & Rating: (p.195)

·        Central Tendency Errors: Inclination of the rater to give an inordinate number of ratings in the middle of the scale and thus avoiding the extremes of the scale.

·        Halo Effect: Threat to internal validity where raters allow previous impressions or knowledge about a certain individual to influence all ratings of that individual’s behaviors. 

·        Leniency: Tendency for observers to be overly generous in rating.

·        Proximity Error: Inclination of rater to consider behaviors to be more nearly the same when they are listed close together on a scale than when they are separated by some distance i.e. the different phases of behavior are rated the same. 

·        Observer Bias Error: Inclination of a rater to be influenced by his or her own characteristics and prejudices. 

·        Observer Bias Error: Inclination of rater to see evidence of certain expected behaviors and interpret observations in the expected direction.

Measurement & Standard Error: (p. 190-192) Every test yields only “observed” scores.  We can obtain only estimates of a person’s “true” score.  It is much better to think of test scores as falling within a range that contains the true score.

·        But how do we compare one score of one test to another score to a different test?  The scores must be converted into “standard scores” and expressed in terms of standard deviations from the mean.  You can determine standard scores by using Z scores or T scales. 

·        Z Score: The basic standard score that converts raw scores into units of standard deviation were the mean is 0 and the SD=1.0.

·        T Scale: Type of standard score that sets the mean at 50 and the SD at 10 to remove the decimal found in Z score and to make all scores “positive” i.e. Z=1.0 is T of 60 and a Z=-1.0 is T of 40.

o       Because 99.73% of scores fall between +/-3s, it is rare to have T scores below 20 (Z= -3.0) or above 80 (Z= +3.0).

 

Reliability: Integral part of validity which pertains to consistency or repeatability of a measure; a measure of the consistency of the data when measurements are taken more than once under the same conditions.

  • The study is repeatable i.e. consistent values when measured over and over.  Test cannot be considered “valid” if it is not reliable i.e. if test is not consistent, cannot depend on successive trials to yield the same results—the test simply cannot be trusted. 
  • Can check reliability with a pilot study.
  • Validity and Reliability must be specific to your population of interest i.e. just because it works with adults doesn’t mean it will work with children.

Reliability Expression:  Expressed by a correlation coefficient ranging from 0.00 to 1.00. The closer to 1.00, the less error variance it reflects and the more the true score is assessed.  Techniques for computing the reliability coefficient are:

1.      Interclass Correlation (Pearson r): This coefficient is a “bivariate” statistic meaning that it is used to correlate two different variables.  The most common used method of computing correlation between two variables.

·        Computations of Pearson r are limited to only two scores of X and Y. 

2.      Intraclass Correlation:  ANOVA used to obtain reliability coefficient.  

·        Test performance can be examined from trial to trial and then the most reliable testing schedule can be selected i.e. the last trials may differ significantly from the first trials because of learning curve or fatigue effect. 

·        ANOVA yields an F score that tells significance.

Reliability Scores:  Test reliability sometimes discussed in terms of scores.

·        Observed Score: Obtained score that comprises a person’s true score and error score.

·        True Score:  Part of observed score that represents the individual’s real score and does not contain measurement error.

·        Error Score:  Part of an observed score that is attributed to measurement error (from participant, testing, scoring, and instrumentation). 

Reliability Types:

·        Inter-Rater or Inter-Observer Reliability:  Used to assess the degree to which different raters/observers give consistent estimates of the same phenomenon or on the same participants e.g., if more than one technician is used, they should score things in the same manner.

o       Need specific instructions and raters have to practice to get their scores close.

·        Test-Retest Reliability:  Used to assess the consistency of a measure from one time or trial to another e.g., if tested for multiple trials or across multiple days, the scores should be the same.

o       Might have to toss the first few tests before the test results “level out” i.e. need to see the scores “flat line” more.  The first few trials/days could account for this “learning curve” so must build this into your model or allow for this learning curve. 

·        Parallel-Forms Reliability:  Used to assess consistency of results of two tests constructed in the same way from the same content domain e.g. comparing two tests of anaerobic power. 

·        Internal Consistency Reliability:  Used to assess the consistency of results across items within a test e.g. two questions that assess the same concept should elicit the same result. 

Reliability Analyzing:

  • Consistency: (different than validity)
    • Correlation & SEE
    • Same day test-retest (usually physical performance)
    • Split-Half Technique
  • Stability:
    • Repeated Measures ANOVA
    • Test-Retest on separate days

 

Validity: The soundness or correctness of a test or instrument in measuring what it is designed to measure i.e. the truthfulness of the test or instrument.

  • Means you measuring what you think you are measuring.
  • Validity and Reliability must be specific to your population of interest i.e. just because it works with adults doesn’t mean it will work with children.

Validity Analyzing:

·        Root Mean Square Error:  Amount of error around line of identity (x=criterion method vs. y=alternative measure method).

o       Best choice because this is compared to “the true.”

·        Standard Error of the Estimate (SEE): Amount of error around regression line (assume significant correlation).

·        Bland-Altman Technique:  Number of cases that fall within the 95% CI of true value.

Validity Types or Categories: 

·        Construct Validity: Degree to which a measure reflects the associated characteristic or to which a test measures a hypothetical construct; usually established by relating the test results to some behavior e.g. someone who scores high on a test for “cooperation” acts cooperatively in a “real-life” setting. 

o       “Usually” a psych-type test but “can be” applied to a physical test.  Does it really reflect the personality it reports to measure in “real life?”

o       Can be tested by “known group difference method” e.g. a skill critical to basketball performance can be performed better by successful basketball players than by downhill skiers. 

o       All other forms of validity are used for evidence of construct-related validity.  It is usually necessary to use evidence from all the other forms to provide strong support for the validity of a particular instrument and the use of its results. 

·        Content Validity: Measurement instrument reflects training i.e. can’t use isokinetic machine to test if you trained subjects on free weights. 

o       Usually educational settings i.e. did the test adequately sample what was covered in the course?  Are there a corresponding number of questions in each area?

·        Criterion Validity:  Degree to which scores on a test are related to some recognized standard or criterion. 

o       Concurrent: “Gold Standard” method and the alternative method used simultaneously or near the same time should yield the same results e.g. underwater weighing should provide about the same estimate of body fat as bio impedance. 

Ø      Type of criterion validity that involves correlating an instrument with some criterion that is administered at about the same time i.e. “concurrently.”

Ø      Usually employed when the researcher wishes to substitute a shorter, more easily administered test for a criterion that is more difficult to measure.  

o       Predictive: Measure can accurately predict some future outcome e.g. GRE scores predict success in graduate school. 

Ø      Degree to which scores of predictor variables can accurately predict criterion scores. 

Ø      Need to determine a “base rate” before you can predict. 

Ø      May have little value if base rate is very low or high.

Ø      Multiple regression used because several predictors=greater validity coefficient. 

Ø      Shrinkage occurs which is when validity decreases after prediction formula is used with “new” sample.  Cross validation must then be used to minimize shrinkage. 

·        Logical/Face Validity:  Appears to test what it intends to test.  Looks like it is testing what it is supposed to so probably is.

o       What it’s measuring is pretty obvious e.g. measuring BP with a thermometer lacks face validity.

o       Degree to which measure obviously involves the performance being measured. 

 RonJones.Org | Back to Top | Back to CSUN 605 | Site Map  

Ron Jones/www.ronjones.org (11-18-01)

 

 

                      Get Fit.  Be Strong.
                                
Corporate Wellness · Consulting · Health Promotion