The
International Journal of Psychosocial Rehabilitation
Moses
N. Ikiugu, PhD, OTR/L
Associate Professor and Director of Research
Assistant Professor and Clinical Fieldwork Coordinator
The
Occupational Therapy Department
This paper was
presented at the 7th Annual
Research Conference of the Society for the Study of Occupations (SSO):
Abstract
The purpose of this study was to determine an estimate of the
mean validity and
generalizability from research to clinical settings of occupational
performance
measurement scores through a metaanalysis of findings from 19 studies.
We used
the validity generalization (VG) method developed by Hunter and Schmidt
(2004).
Our analysis indicated that the mean weighted validity coefficients
were small
according to interpretation guidelines outlined by Cohen (1988). Scores
based
on selfreport assessments had the highest mean validity. When variance
of the
coefficients was corrected for attenuation by sampling error and
variability of
the test criterion measurement reliability, less than 75% (suggested by
Hunter
and Schmidt to be the decision rule) was explained. This suggested that
the
validity of the instruments investigated in the 19 studies could not be
transferred from research to clinical settings without need for further
validation. Further metaanalysis is indicated for more definite
conclusions to
be reached.
Introduction
In
recent times, evidencebased practice has been advocated as a way of
providing effective
and efficient occupational therapy services (American Occupational
Therapy
Foundation [AOTF], 2007; Canadian Association of Occupational
Therapists
[CAOT], 2007; Coster, Gillette, Law et al., 2004; Law, Baum, &
Dunn,
2005). The most preferred type of evidence in medical disciplines
is that
derived from a metaanalysis of randomized controlled trials (Coster et
al.;
Depoy & Gitlin, 2005; Trombly & Ma, 2002). This is because a
synthesis
of research decreases the demand on therapists to retrieve and evaluate
individual studies (Bennet & Townsend, 2006).
Polatajko
(2006) suggested that there was a lack of enough
research evidence available for synthesis in metaanalysis to support
occupational therapy interventions. She argued that for that
reason, there
was not enough data to support evidencebased practice in occupational
therapy. This could be due to limited research history and
infrastructure
in the profession (Ilott 2004). Before enough evidence can be
accumulated
for summary and synthesis in order to support practice, data have to be
adduced
using valid and reliable research instruments. Therefore, the
first
concern should be whether we have valid instruments to produce needed
evidence.
In
this paper, it is argued that evidence relevant to occupational
therapy must be related to occupational performance, as defined by
occupational
therapists and occupational scientists. Therefore, the issue at
hand is
whether or not we have valid instruments to measure occupational
performance. Having such instruments would provide a starting
point for
accumulation of the appropriate type of evidence to support
evidencebased
occupational therapy practice.
The
purpose of this study was to investigate: (1) the robustness
of mean validity coefficients of various types of occupational
performance
measurement scores after correction for sampling error and variability
of the
test criterion measurement reliability; and (2) whether the validity of
the
occupational performance scores could be generalized from research to
clinical
settings. Following were the specific questions guiding this
inquiry: (1)
What were the mean weighted validity coefficients of the various types
of
occupational performance measurement scores after correction for
sampling error
and variability in test criterion reliabilities (by test criterion was
meant
the occupational performance indicators that were observed during
measurement)?
(2) Could the validity coefficients of occupational performance scores
be
generalized from research to clinical settings without need for further
validation? In other words, if a therapist chose an occupational
performance
measurement instrument with documented validity, could it be assumed
that such
validity would hold in the clinical setting where he/she intended to
use the
instrument without need for further validation research? Answering the
above
questions would give the therapist confidence in defending h/her choice
of
occupational performance assessment instruments to clients, payers for
occupational therapy services, the public, etc.
Research
Methods
The
Validity generalization (VG) method was developed by Schmidt and Hunter
(1977)
and recently updated by Hunter and Schmidt (2004). It is a type of
metaanalysis aimed at determining the overall validity of measurement
scores
for a given phenomenon, as well as whether such validity can be
generalized
from the research setting to other situations. This type of study tests
the
situationspecificity hypothesis (Callender & Osburn, 1980; Hunter
&
Schmidt, 2004; Pearlman, Schmidt, & Hunter 1980; Schmidt &
Raju, 2007;
Schmidt, Law, Hunter et al., 1993; Schmidt & Hunter 1977). The
hypothesis,
which was originally proposed by Schmidt and Hunter, was restated for
the
purpose of this study as follows: Occupational performance scores
observed in
assessments vary from situation to situation due to local modifiers
[also known
as “random effects” (Brannick & Hall, 2001, p. 1)]. Therefore, they
cannot
be generalized to situations other than where the validation study was
conducted without requiring further research.
According
to Schmidt and Hunter (1977), the situationspecificity
hypothesis was in the past considered to be selfevident in personnel
psychology (the field for which VG procedures were originally
developed). It
originated from the empirical observation that “considerable
variability is
observed from study to study in raw validity coefficients even when
jobs (types
of occupations in our study) and tests studied appear to be similar or
essentially identical…” (Pearlman, Schmidt, & Hunter, 1980, p.
373). This
meant that the validity of personnel selection methods could not be
assumed for
practical use even though past studies had indicated that they were
valid. It
also meant that general principles about personnel selection could not
be
developed because, “the inability to generalize validities makes it
impossible
to develop the general principles and theories that are necessary to
take the
field (personnel psychology) beyond a mere technology to the status of
a
science” (p. 374).
The
same can be said of occupational performance. It can be argued
that even when occupational performance measurement instruments have
been
proven valid, their validity only pertains to the research setting. A
therapist
cannot assume that such instruments are valid in a variety of clinical
situations. The assumption that validity cannot be generalized would
mean that
general principles about occupational performance measurement cannot be
developed. In order for therapists to have complete confidence in the
validity
of their occupational performance instruments, and in order to embark
on the
process of developing general principles about occupational performance
measurement, the situationspecificity hypothesis of occupational
performance
measurement has to be proven null, hence, the need for VG research.
In
general, the VG method used in this study consisted of the
following steps: 1) computation of an estimate of observed variance of
occupational performance score validity coefficients for studies
included in
the analysis; 2) computation of an estimate of variance of occupational
performance score validity coefficients attributable to statistical
artifacts
such as sampling error, differences in criterion reliability,
differences in
test reliability, reliability criterion contamination, computational
and
typographical errors, range restriction, etc. In the present study, due
to
limited information (such as lack of: independent variable reliability
coefficients; range restriction data; etc.) provided in the studies
that
constituted our sample, we could not do a complete metaanalysis.
Instead, we
used Hunter and Schmidt’s (2004) “barebones” (p. 134) procedures to
correct
observed variability for variance due to differences among studies in
sampling
error and variability in test criterion reliability; 3) subtraction of
variance
due to statistical artifacts from observed variance; 4) division of the
variance due to artifacts by observed variance; 5) Using the 75%
decision rule
suggested by Schmidt and Hunter (1977) and Hunter and Schmidt (2004) to
accept
or reject situationspecificity hypothesis based on how much of the
observed
variance could be accounted for by statistical artifacts; (6)
Determining the
robustness of the validity of occupational performance measurement by
computing
preattenuated (corrected) mean weighted validity coefficients and
their
standard deviations.
We
used the following formulas provided in the procedures
described by Hunter and Schmidt:
Vobs=∑[Ni(rirm)2]/∑Ni
(1)
where
Vobs = the observed variance of occupational performance
score validity coefficients; Ni = the sample size associated with the
ith
validity coefficient; ri = the ith validity coefficient; rm = the
weighted mean
of the validity coefficients; and ∑Ni = the overall sample size
associated with
the validity coefficients across all studies included in the analysis.
The
uncorrected weighted mean validity coefficients were
calculated using the following equation:
rm=∑[Niri]/∑Ni
(2)
where
rm = uncorrected weighted mean validity coefficient; Ni =
the sample size associated with the ith validity coefficient; ri = the
ith
validity coefficient; and ∑Ni = the overall sample size associated with
validity coefficients across studies included in the analysis.
The attenuating factor for criterion reliability measurement error for
each
reliability coefficient was calculated using the equation:
a=√ri
(3)
where
a=attenuating factor, ri=the ith validity coefficient.
The
mean attenuation factor for criterion reliability measurement
errors was calculated using the equation:
am=∑[√ri]/n
(4)
where
am=mean attenuating factor; ri=the ith reported validity
coefficient; and n=number of validity coefficients in the analysis.
The preattenuated (corrected) mean weighted validity coefficients were
calculated
using the equation:
P=rm/am
(5)
where
P=corrected weighted mean validity coefficient (an estimate
of the true population validity coefficient); rm=uncorrected mean
weighted
validity coefficient; and am=mean attenuating factor.
The variability of test criterion reliability attenuation factors was
computed
using the following equation:
SDa2=∑(aam)2/n1
(6)
where
SDa2=Variance of attenuating factors.
The sum of squared coefficients of the test criterion reliability
variation was
computed using the equation:
V=SDa2/am2
(7)
Variance of reported validity coefficients due to variability in test
criterion
reliability was computed using the equation:
S2=P2am2V
(8)
The
estimated variance due to sample size was calculated using the
following equation:
Vs=(1rm2)2/Nm1
(9)
where
Vs = estimated variance due to sample size; rm = uncorrected
weighted mean validity coefficient; and Nm = the mean sample size for
all the
19 studies included in our analysis.
Residual
variability was calculated by subtracting combined
variance due to sampling error and variability in test criterion
measurement
from the observed variance thus:
VRes=VobsVsS2
(10)
where
VRes = true or residual variance; Vobs=Observed variance;
Vs=Variance due to sampling error; S2=Variance due to study differences
in test
criterion measurement reliability. The percentage of variance accounted
for by
a combination of sampling error and variability in test criterion
measurement
reliability was obtained using the equation:
(Vs+S2)/VObs
(11)
Finally, the preattenuated (corrected) residual variance was computed
using
the equation:
V(P)=[VobsVsS2]/am2
(12)
where
V(P)=Corrected residual variance. The corrected standard
deviation (SD(P)) was the squareroot of the corrected residual
variance
(V(P)).
The
above description of the procedures and the results of our
study were emailed to Dr. Schmidt, the leading author of the VG method.
His
feedback indicated that our findings were generally correct [Schmidt,
personal
communication,
Selfmaintenance
(Basic ADLs such as
dressing, bathing, toileting, and obtaining nutrition; and Instrumental
Activities of Daily Living such as community mobility, shopping for
clothes and
other items, and shopping for groceries)
Productivity
(unpaid work such as
home management and care for family members; paid work; volunteering,
etc.)
Leisure
(both quiet and active
recreation)
Education
Play
(sports and child play)
Social
participation (including
community participation, family related occupations, and peer related
occupations) (American Occupational Therapy Association [AOTA], 2002;
Baum
& Christiansen, 2005; Law et al., 2002)
The
above listed criteria were derived from the new occupational
therapy paradigm which emphasizes occupationbased, clientcentered,
and
collaborative approach to intervention as central components of
authentic
occupational therapy practice (Ikiugu, 2007; Kielhofner, 2004; Law et
al.,
2002).
Occupational
performance measurement scores reported in the
studies included selfreported ratings of performance, observed
performance
(either by a therapist or by other people), interviewbased assessment
of
performance, or a combination of any or all of the above (see
discussion on
“data coding” discussed below). Also, Schmidt et al. (1993)
demonstrated that
nonPearson validity coefficients tended to overestimate artifactual
variance
and therefore to underestimate true variance. They recommended
exclusion of
such coefficients from the VG analysis. Therefore, Spearman rho,
Cronbach’s
alpha, Intraclass Correlation Coefficients (ICC), etc. were not
included in our
sample. The studies retrieved for our analysis investigated;
convergent,
divergent, criterion referenced, and predictive validities of
occupational
performance.
Study
Sample
In
addition, we searched a variety of electronic databases for
relevant studies published between 1977 and 2007 (a 30 year span).
Later, we
updated the search to include studies published in 2008. Key phrases
used in
the search included “occupational and performance”, “occupational,
performance,
and measurement”, and “validity, occupational performance, and
measurement
instruments”. The databases searched included the EBSCO Mega FILE; Ovid
(Cochrane database of systemic reviews, All EBM reviews); Cochrane,
DSR, ACP,
DARE & CCTR; Health and Psychosocial Instruments, Ovid Medline
InProcess
and other nonindexed citations and Ovid Medline 1950 to present;
CINAHL;
PsycINFO; and OT Search. The history and outcome of the search are
reported in
table 1.
Table
1
Obtaining the Data Sample: Search History and Outcome
Key phrase 
Databases Searched and Number
of Hits Realized 



OT Search 
EBSCO Mega FILE 
Ovid 
PsycINFO 
Total 

Occupational Performance 
940 
134 
1285 
266 
2625 

Occupational Performance
Measurement 
28 
2 
21 
2 
53 

Validity of Occupational
Performance Measures 
25 
0 
2 
0 
27 

As
can be seen in table 1, the key words “occupational and
performance” resulted in 2625 hits. When the scope was narrowed using
the key
words “occupational, performance, and measurement instruments”, the
number of
hits was reduced to 53. The terms: “validity, occupational performance,
and
measurement instruments” yielded 27 studies. Thus, 80 studies were
found to be
closely related to the purpose of our study (i.e., measurement of
occupational
performance). The abstract for each of the identified studies was
reviewed to
determine its specific relevance to our objective of determining the
robustness
and generalizability of the validity of occupational performance
measurement
scores. In all, 75 studies were retrieved and downloaded but only 25 of
them
were found to be relevant to our investigation. The 75 studies were
downloaded
online, copied from the bound Journal collection in the University of
South
Dakota (USD) library, or acquired through interlibrary loan.
When
studies in which the validity estimate coefficients were
nonPearson were removed, only 14 studies remained. A later search for
studies
published between the years 2006 and 2008 revealed 5 more studies in
which
validity of occupational performance measurement using Pearson type
validity
estimates was investigated. Therefore, a total of 19 studies were
included in
our analysis. The studies are indicated in the reference list by
asterisks (*).
The instruments that were the subject of the studies were: Occupational
Performance Calculation Guide (OPCG); All about me; Canadian
Occupational
Performance Measure (COPM); Functional Behavior Profile (FBP); World
Health
Organization Disability Assessment Schedule (WHODAS); Late Life
Function and
Disability Instrument (LLFDI); Barthel Index; Vineland Adaptive
Behavior Scales
(VABS); Assessment of Occupational Functioning (AOF); Functional
Independence
Measure (FIM); Assessment of Motor and Process Skills (AMPS);
Occupational
Abilities and Performance Scale (OAPS); Executive Function Performance
Test
(EFPT); 3Day Physical Activity Recall Questionnaire (3dPAR); and the
Activity
Card Sort – Hong Kong version (ACSHK).
Data
Coding
Observationbased
occupational
performance – Data based on measurement of occupational performance by
observation using instruments such as the Assessment of Motor and
Process
Skills (AMPS), Functional Independence Measure (FIM), etc.;
Interviewbased
occupational
performance – Data based on measurement of occupational performance
using
interview based instruments such as the Canadian Occupational
Performance
Measure (COPM);
Self
reportbased occupational
performance – Data based on clients rating of their perceived
occupational
performance using instruments such as the Role Checklist;
Combined
Measures – Data based on
measurement of occupational performance using instruments that combine
observation, interview, and selfrating.
We
read the articles retrieved from our search as explained above. We
created a
matrix in which we outlined the data as follows: study author;
instruments that
were the subject of the study; type of occupational performance scores
gathered
using the instrument (observationbased, interviewbased, or
selfreport); type
of occupational performance variable measured (e.g., productivity,
leisure,
play, etc.); type of validation study (convergent, divergent,
criterionreferenced, or predictive); sample size (n), and validity
estimate
coefficient (r). We entered the validity data (study sample sizes and
validity
coefficients) as laid out in our data matrix into a Microsoft Office
2007 Excel
Spreadsheet. Each of the researchers checked the entries at least twice
in
order to ensure accuracy.
Data Analysis
Findings
Table
2
Validity Generalization MetaAnalysis Results for Various types of
Occupational
Performance Measurement Scores (n=19)
Type of OP
Measure 
Total N 
Mean N 
No. of rs 
r_{m} 
a_{m} 
p 
Observed Variance 
V_{s}+S^{2}

% of Variance
Acct. 
Corrected
Residual Variance (V_{(p)}) 
SD_{(p)} 
95% CI 
All Studies
Combined 
40837 
91.98 
444 
.065 
.676 
.096 
.259 
.011 
4 
.544 
.738 
1.40≤p≤1.49 
InterviewBased
(n=9) 
36724 
122.01 
301 
.060 
.698 
.086 
.265 
.008 
3 
.526 
.725 
1.38≤p≤1.46 
ObservationBased
(n=6) 
2455 
45.46 
54 
.021 
.531 
.04 
.204 
.023 
11 
.643 
.802 
1.56≤.p≤1.59 
Self
ReportBased (n=4) 
1658 
18.63 
89 
.228 
.637 
.358 
.193 
.055 
29 
.34 
.583 
1.01≤.p≤1.28 
Key: n=Number of
studies in
the analysis; N = Sample size; r=Reported occupational
performance score
validity coefficient; r_{m}=Uncorrected weighted mean
occupational
performance score validity coefficient; a_{m} = Mean
attenuation factor
for test criterion reliability (test criterion = occupational
performance); p
= preattenuation or corrected mean weighted validity coefficient; V_{s}+S^{2}
= Combined variance due to sampling error and variability in test
criterion
measurement reliability; % of variance acct. = Percentage of
occupational
performance score validity coefficient variance accounted for by
sampling error
and test criterion reliability differences among studies; V_{(p)}
= Corrected or preattenuated residual variance of validity
coefficients; SD_{(p)}
= Corrected or preattenuated standard deviation of validity
coefficients; and
CI =Credibility Interval (at 95% = p±1.96SD_{(p)}).
All
studies combined
Interviewbased
scores
As
can be seen in table 2, the mean weighted r for interviewbased
occupational
performance measurement was .06 (P=.086). According to Cohen and
Kraemer et
al., this coefficient similarly constituted low validity.
Observationbased scores
The
weighted mean r for observationbased occupational performance
measurement
scores was .021 (P=.04). Again, this coefficient indicated low validity
for
this category of occupational performance measurement scores in
comparison to
typical instruments used in social science research.
Self reportbased scores
The
corrected mean weighted r for self reportbased occupational
performance scores
was .23 (P=.36). This coefficient, according to Cohen and Kraemer et
al., was
still low but it was close to medium validity (r=.30) in comparison
with
typical instruments used in social science research. The preattenuated
validity coefficient (P) was clearly in the medium validity range.
Generalizability
of Validity Coefficients of Occupational
Performance Scores
Overall
generalizability
The
residual overall variability of occupational performance validity
coefficients
after correction for attenuation by sampling error and variability of
test
criterion measurement reliability was .544, and about 4% of observed
variance
was explained by the two statistical artifacts. Based on the 75%
decision rule,
the situationspecificity hypothesis could not be rejected. This wide
variability of validity coefficients was apparent in the CI whose
preattenuated validity coefficient values ranged between P=1.40 and
1.49
[variability by more than one Standard deviation at .05 confidence
level
(1.96)].
Interviewbased
scores
The
residual variability of interview based validity coefficients was .526.
About 3
% of observed variance was explained by the artifacts. Again, based on
the 75%
decision rule, situationspecificity hypothesis could not be rejected
for
interviewbased occupational performance measurement score validity.
The CI of
interviewbased occupational performance preattenuated validity
coefficients
ranged between P=1.38 and 1.46.
Observationbased
scores
The
corrected residual variability of the observationbased occupational
performance scores was .643. About 11 % of the observed variance was
attributable to sampling error and variability in test criterion
measurement
reliability. Therefore, the situationspecificity hypothesis could not
be
rejected, and as can be seen in table 3, the credibility interval was
remarkably large (P=1.56 to 1.59).
Self
reportbased Scores
The
corrected residual variability for self reportbased occupational
performance
score validity coefficients was .34 (the lowest variability among all
occupational performance measurement instruments in our sample). About
29% of
the observed variance was attributable to the sampling error and
variability in
test criterion measurement reliability, again leading to failure to
reject the
situationspecificity hypothesis.
Discussion
Validity
Generalization analysis results in two mean validity coefficients: one
attenuated (or modified) by statistical artifacts (in our case sample
size and
variability of test criterion reliability); and the other a
preattenuated
coefficient. Our interpretation of those mean coefficients was based on
Cohen’s
(1988) guidelines for determining the effect sizes and their
importance. In
applying Cohen’s notion of effect sizes, we can think of the effect of
an
assessment in terms of its ability to detect the indicators of
occupational
performance as defined in this study. In other words, it is a reference
to the
effectiveness of an assessment in detecting and measuring occupational
performance as we have defined it. Cohen suggested that the effect
sizes in
social sciences were generally small in comparison to other disciplines
because
of attenuation of validity of the measures used and the subtlety of the
variables involved. Consequently, he defined effect sizes (d) as
follows: small
(d=.20; r=.10); medium (d=.50; r=.30); and large (d=.80; r=.50).
Based
on the above criteria, our analysis indicated that the
weighted mean occupational performance score validity coefficients for
all the
19 studies combined, interviewbased, and observationbased
occupational
performance scores were small (r=.065, .060, and .021 respectively).
The
weighted mean validity coefficient for selfreport based scores was
small but
approached the medium range (r=.23). However, Cohen’s interpretation
criteria
do not take into account the confounding effect of the sample size or
the
statistical significance. Of course, we need to remember that as
Valentine and
Cooper (2003) argued, statistical significance is not the best measure
of
effect size because it does not provide information about “practical
significance or relative impact of the effect size” (p. 1, emphasis
original).
Pearson
and Hartley (1962) provided guidelines for interpretation
of the statistical significance of r based on a chosen power level and
p value.
Based on Pearson and Hartley’s guidelines, the preattenuated validity
coefficients (P) at .80 power level and p=.05 can be interpreted as
follows:
Overall, the mean preattenuated validity coefficient (P=.096) with a
mean
sample size of N=.91.98 was not statistically significant. The p value
would
need to be at least .196 in order for P to be significant for that mean
sample
size. Using the same criteria, the preattenuated interview,
observation, and
selfreportbased validity coefficients were similarly not
statistically
significant [mean N=122.01, p=.086 (critical value=.196); mean N=45.46,
p=.04
(critical value=.444); and mean N=18.63, p=.36 (critical value=.632
respectively]. Therefore, our analysis indicated that the mean weighted
validity
coefficients of occupational performance measurement instruments for
the
studies in our sample were not statistically significant. They were all
small
in comparison with validity of assessments used in social science
research.
Our
search of literature did not reveal other occupational
performance validity generalization studies with which we could compare
our
findings. Ottenbacher, Hsu, Granger, and Fielder (1996) completed a
metaanalytic study in which they found rs ranging between .84 and .92.
However,
they did not use the VG method. Rather, they converted reliability
coefficients
into Fisher z scores for comparison across studies and then converted
them back
to reliability coefficients for interpretation. They did not weight
their
observed rs with sample sizes or correct the mean rs for attenuation by
variability in the test criterion reliability. Therefore, their
findings were
not comparable to ours.
The
mean weighted rs in the present study were comparable to those
found in other Validity Generalization studies in social science
research such
as Hackett (1989) who found mean weighted coefficients ranging between
.04 and
.17; and Pearlman, Schmidt, and Hunter (1980) who found mean weighted
rs
ranging between .07 and .26. Our findings were therefore consistent
with
Cohen’s assertion alluded to earlier that: “Many effects sought in
personality,
social, and clinicalpsychological research (to which occupational
performance
may be categorized) are likely to be small…because of the attenuation
in validity
of the measures employed and the subtlety of the issue frequently
involved”
(Cohen, 1988, p. 13). Therefore, occupational therapists and scientists
should
not be alarmed by the small mean validity coefficients of occupational
performance measurement instruments found in our analysis. Our findings
indicated that the mean validity of the instruments compared well with
that of
other instruments used in social science research to measure phenomena
that are
as elusive as occupational performance.
The
mean weighted validity coefficient for self reportbased
occupational performance scores was the highest in our analysis
(rm=.23,
P=.34). This finding suggested that among the occupational performance
measurement assessments used in occupational therapy, those that were
based on
selfreport of occupational performance, such as the Assessment of
Occupational
Functioning (AOF) and the 3Day Physical Activity Recall Questionnaire
(3dPAR)
were the most valid. Given the few studies in this category in our
sample (n=4),
this finding was interesting. It denoted possibility of high mean
validity
coefficients for such assessments in future VG research when more
studies
become available. This finding could be particularly important in light
of the
current occupation therapy paradigm which emphasizes
clientcenteredness in
therapeutic discourse. It means that use of instruments that require
clients to
identify their own occupational performance priorities, consistent with
the
clientcentered focus of the professional paradigm, may be the most
valid
approach to occupational performance assessment.
In
general, more VG research is indicated as more studies become
available so that more conclusive findings may be realized. In
addition, the
reader should note that our findings are very limited because we could
only
include Pearson type validity coefficients in our analysis. If the
entire range
of coefficients were included, it is possible that the mean validities
could be
different. One of the most important findings in our study was that
none of the
occupational performance score validity estimate coefficients were
found to be
generalizable using the 75% decision rule proposed by Schmidt and
Hunter (1977)
and Hunter and Schmidt (2004) after correction for variability due to
statistical
artifacts. Self reportbased occupational performance scores were the
most
generalizable with 29% of observed variance being attributed to
sampling error
and variability in test criterion measurement reliability.
Interviewbased
occupational performance validity coefficients were the least
generalizable
with the two statistical artifacts accounting for only 3% of observed
variance.
This
lack of generalizability would suggest that therapists cannot
automatically assume validity of occupational performance assessments
(note
that this statement applies only to Pearson validity estimate
coefficients)
even if such instruments have been proven valid in prior research.
However, it
is important to remember that only variance due to sampling error and
test criterion
measurement reliability variability was accounted for. The studies
included in
our analysis did not provide enough information to allow us do a
complete
metaanalysis in which all attenuating factors would have been
modulated.
Therefore, more validation studies were indicated in order to determine
conclusively the generalizability of validity of occupational
performance
measurement instruments in the clinical situations where therapists
need to use
them.
Furthermore,
it may be clinically useful to bear in mind that
according to our findings, self reportbased occupational performance
scores
had the most robust mean validity coefficient and the highest
generalizability
from research to clinical settings even with our limited metaanalysis
results
where many attenuating factors were not accounted for. This finding was
encouraging because it indicated that assessment instruments that were
the most
consistent with the occupational therapy paradigm had the greatest
promise of
being found to be valid and transferable to clinical situations in
future VG
analysis involving more studies. However, our findings were not
conclusive. In
the end, until further analysis is completed to reach more definitive
conclusions about occupational performance score validity
generalization,
therapists and scientists need to review available validity research
for
instruments that they want to use in practice and make their decisions
based on
clinical expertise and specific circumstances of practice.
It
has also been suggested in literature that the percentage of
observed variance attributable to statistical artifacts may not be a
good way
of making decisions about VG. Rather, it is “The actual amount of
observed
variance” that is important (McDaniel, Hirsh, Schmidt, et al 1986, p.
144).
Even Hunter and Schmidt (2004) suggested that the 75% rule has been
misinterpreted. It is not a means of statistically testing for chance
fluctuations in sampling error. In situations such as our study where
the
number of studies in the metaanalysis was small, the value of observed
variance could be significantly larger than may be accounted for by
sampling
error alone. In that sense, the relatively small variance (.193) of
self
reportbased validity coefficients is of interest because it means that
the
validity of such instruments could conceivably be generalizable.
However, there
is no meaningful way other than the 75% rule of determining an absolute
value
of observed variance that would be considered small enough to allow
generalizability.
Based
on our findings, it may be beneficial to replicate this
study and include published and unpublished studies in order to
determine if
generalizability of occupational performance score validity
coefficients is
really a problem or whether our findings are due to limitation of the
types of
studies included in the analysis. In addition, an investigation of
modifiers
that increase variability of validity coefficients from situation to
situation
may be beneficial.
Limitations
and
Recommendations
One of the limitations of this study may be the small number of studies
included in the metaanalysis (only 19 studies). However, further
literature
review indicated that this was typical of Validity Generalization
research. In
his review, Hackett (1989) found that the number of studies included in
a
variety of metaanalyses ranged between 20 and 31, and the number of
correlations between 106 and 707. Therefore, it was typical to have a
small
number of studies in a metaanalysis, perhaps due to stringent
inclusion criteria
necessary in order to ensure commonality of characteristics that make
generalizability possible.
A
more significant limitation was the fact that only published
studies were included in our analysis. This could have tended to
exclude
methodologically poor studies, such as those with inadequate power due
to small
sample sizes, failing to take into account the broad spectrum of
validation
research and therefore skewing the findings (Schmidt & Hunter,
1977). That
is why many VG investigators tend to seek to identify both published
and
unpublished studies for inclusion in their metaanalysis (Pearlman,
Schmidt,
& Hunter, 1980).
The
strength of this study was the fact that occupational
performance measurement was stringently defined consistent with the
current
occupational therapy paradigm. Therefore, rejection of the
situationspecificity hypothesis would have meant that a therapist only
needed
to analyze the clinical situation where he/she intended to use an
instrument
and if the circumstances of clinical practice were similar to the
validation
research situation, and if the assessment instrument measured
occupational
performance as defined in this study, he/she could use the instrument
with
assurance of validity without need for further validation.
It
is recommended that this study be replicated and an attempt be
made to include nonpublished studies in the replication. Also, in
future
studies, attempts to convert nonPearson type validity coefficients to
forms
analyzable using the VG method should be made in order to make the
metaanalysis more inclusive and complete. In addition,
situationspecific
modifiers that increase variability of validity coefficients and
therefore make
their generalizability difficult should be investigated. One way to do
that may
be to find out the common characteristics of occupational performance
measurement instruments and to complete separate metaanalyses of
groups of
assessments based on those characteristics. That would reveal groups in
which
variances of validity coefficients differ significantly, allowing us to
draw
conclusions about attenuating factors that are the most important in
occupational performance measurement. Finally, the present
metaanalysis should
be regularly updated using the methods proposed by Schmidt and Raju
(2007) as
new validation studies become available.
Conclusion
In
this study, a metaanalysis was completed using Hunter and Schmidt’s
(2004)
methods in order to determine the robustness of the mean validity
coefficients
of occupational performance scores and generalizability of those
coefficients
from research to clinical situations. Our analysis revealed that the
mean
weighted validity coefficients of self reportbased occupational
performance
measurement scores were the most robust suggesting that self
reportbased
occupational performance measurement instruments were the most valid.
After
correction of observed variance by subtracting variability due to
sampling
error and variability of the test criterion measurement reliability,
too much
residual variance was unexplained based on the 75% decision rule.
Therefore,
situationspecificity was not rejected and generalizability of the
validity of
occupational performance measurement scores could not be justified. It
is
important to bear in mind that many attenuating factors that could have
explained such variance were not accounted for because of limited
information
provided in the studies included in our sample. Because of that
limitation,
definite conclusions could not be drawn. Future research should
investigate
situation modifiers that increase variability of validity coefficients
making
them ungeneralizable.
*Aitken,
D., & Bohannon, R. W. (2001). Functional Independence Measure
versus Short
Form36: relative responsiveness and validity. International Journal of
Rehabilitation, 24, 6568.
American
Occupational Therapy Association. (2002). Occupational therapy practice
framework:
Domain and process.
American
Occupational Therapy Foundation. (2007). Resource center 
Evidencebased
practice.
Retrieved
Baum,
C. M., & Christiansen, C. H. (2005).
Personenvironmentoccupationperformance:
An occupationbased framework for practice. In: C. H. Christiansen
& C. M.
Baum (Eds.), Occupational therapy: Performance, participation, and
wellbeing.
*Baum,
C. M., Connor, L. T., Morrison, T., Hahn, M., Dromerick, A. W., &
Edwards,
D. F. (2008). Reliability, validity, and clinical utility of the
executive
function performance test: A measure of executive function in a sample
of
people with stroke. American Journal of Occupational Therapy, 62,
446455.
*Baum,
C. M., Edwards, D. F., & MorrowHowell, N. (1993). Identification
and
measurement of productive behaviors in senile dementia of the Alzheimer
type.
The Gerontologist, 33, 4038.
Bennett,
S., & Townsend, L. (2006). Evidencebased practice in occupational
therapy:
International
initiatives. World Federation of Occupational Therapists Bulletin, 53,
611.
Brannick,
M. T., & Hall, S. M. (2001, April). Reducing bias in the
SchmidtHunter meta
analysis.
Poster session presented at the 16th Annual Conference of the Society
for
Industrial and Organization Psychology,
Callender,
J. C., & Osburn, H. G. (1980). Development and test of a new model
for
validity
generalization. Journal of Applied Psychology, 65, 543558.
Canadian
Association of Occupational Therapists. (2007). Joint position
statement on
evidencebased
occupational therapy 1999. Retrieved
*Carpenter,
L., Baker, G. A., & Tydesley, B. (2001). The use of the Canadian
Occupational
Performance Measure as an outcome of a pain management program.
Canadian
Journal of Occupational Therapy, 68(1), 1622.
*Chan,
V. W., Chung, J. C., & Packer, T. L. (2006). Validity and
reliability of
the activity card sort –
*Chwastiak,
L. A., & von Korff, M. (2003). Disability in depression and back
pain
evaluation of the World Health Organization Disability Assessment
Schedule (WHO
DAS II) in a primary care setting. Journal of Clinical Epidemiology,
56, 50714.
Cohen,
J. (1988). Statistical power analysis for the behavioral sciences (2nd
ed.).
Coster,
W. J., Gillette, N., Law, M., Lieberman, D., & Scheer, J. (2004).
International conference on evidencebased occupational therapy.
Retrieved
DePoy,
E., & Gitlin, L. N. (2005). Introduction to research: Understanding
and
applying multiple strategies (3rd ed.).
*Dubuc,
N., Haley, S. M., Ni, P., Kooyoomjian, J. T., & Jette, A. M.
(2004).
Function and disability in late life: Comparison of the LateLife
Function and
Disability Instrument to the ShortForm36 and the London Handicap
Scale.
Disability and Rehabilitation, 26(6), 36270.
*Fricke,
J., & Unsworth, C. A. (1996). Interrater reliability of the
original and
modified Barthel Index and a comparison with the Functional
Independence
Measure. Australian Occupational Therapy Journal, 43, 229.
Hackett,
R. D. (1989). Work attitudes and employee absenteeism: A synthesis of
the
literature, Journal of Occupational Psychology, 62, 235248.
Hunter,
E., & Schmidt, F. L. (2004). Methods of metaanalysis: correcting
error and
bias
in
research findings (2nd ed.).
Ikiugu,
M. N. (2007). Psychosocial conceptual practice models in occupational
therapy:
*Ikiugu,
M., & Ciaravino, E.
A. (2006). Assisting
adolescents experiencing emotional
and
behavioral difficulties (EBD) transition to
adulthood. International
Journal of Psychosocial Rehabilitation. 10(2), 5778.
Ilott,
L. (2004). Challenges and strategic solutions for a research emergent
profession.
American
Journal of Occupational Therapy, 58, 347352.
*Karidi,
M. V., Papakonstantinou, K.,
Stefanis, N., Zografou, M., Karamouzi, G., Skaltsi,
P.,
et al. (2005). Occupational abilities and performance scale:
Reliabilityvalidity assessment factor analysis. Social Psychiatry and
Psychiatric Epidemiology, 40, 417424.
Kielhofner,
G. (2004). Conceptual foundations of occupational therapy (3rd ed.).
Kraemer,
H., Morgan, G. A., Leech, N. L., Gliner, J. A., Vaske, J. J., &
Harmon, R.
J. (2003). Measures of clinical significance. Journal of American
Academic
Child and Adolescent Psychiatry, 42(12), 15241529.
Law,
M., Baum, C., & Dunn, W. (2005). Measuring occupational
performance:
Supporting best practice in occupational therapy.
Law,
M., Polatajko, S., Baptiste, S., & Townsend, E. (2002). Core
concepts of
occupational therapy. In E. Townsend (Ed.), Enabling occupation: An
occupational therapy perspective (pp. 2956).
*McColl,
M. A.,
McDaniel,
M. A., Hirsh, G. R., Schmidt, F. L., Raju, N., & Hunter, J. E.
(1986).
Interpreting the results of metaanalytic research: A comment on
Schmitt,
Gooding, Noe, and Kirsch (1984). Personnel Psychology, 39, 141148.
*McNulty,
M. C., & Fisher, A. G. (2001). Validity of using the Assessment of
Motor
and Process Skills to estimate overall home safety in persons with
psychiatric
conditions. American Journal of Occupational Therapy, 55, 649–55.
*Missiuna,
C. (1998). Development of “All About Me”, a scale that measures
children’s
perceived
motor competence. Occupational Therapy Journal of Research: Occupation,
Participation, and Health, 18(2), 85108.
*Mori,
A., & Sugimura, K. (2007). Characteristics of Assessment of Motor
and
Process Skills and Rivermead Behavioral Memory Test in elderly women
with
dementia and communitydwelling women.
Ottenbacker,
K., Hsu, Y., Granger, C., & Fiedler, R. (1996). The reliability of
the
functional independence measure: A quantitative review. Archives of
Physical
Medicine and Rehabilitation, 77, 12261232.
Pearlman,
K., Schmidt, F. L., & Hunter, J. E. (1980). Validity generalization
results
for tests used to predict job proficiency and training success in
clerical
occupations. Journal of Applied Psychology, 65, 373406.
Pearson,
E. S., & Hartley, H. O. (Eds.). (1962). Biometrika tables for
statisticians
(2nd ed.).
Polatajko,
H. J. (2006). In search of evidence: Strategies for an evidencebased
practice
process. Occupational Therapy Journal of Research, 26, 23.
*Ripat,
J., Etcheverry, E., Cooper, J., & Tate, R. (2001). A comparison of
the
Canadian Occupational Performance Measure and the Health Assessment
Questionnaire. Canadian Journal of Occupational Therapy, 68(4),
24753.
*Rochman,
D. L., Ray, S. A., Kulich, R. J., Mehta, N. R., & Driscoll, S.
(2008).
Validity and utility of the Canadian Occupational Performance Measure
as an
outcome measure in a craniofacial pain center. Occupational Therapy
Journal of
Research: Occupation, Participation and Health, 28(1), 411.
*Rosenbaum,
P., Saigal, S., Szatmari, P., & Hoult, L. (1995). Vineland Adaptive
Behavior Scales as a summary of function outcome of extremely low birth
weight
children. Developmental Medicine and Childhood Neurology, 37, 577586.
Schmidt,
F. L., & Raju, N. S. (2007). Updating metaanalytic research
findings:
Bayesian approaches versus the medical model. Journal of Applied
Psychology,
92, 297308.
Schmidt,
F. L., Law, K., Hunter, J. E., Rothstein, H. R., Pearlman, K., &
McDaniel,
M. (1993). Refinements in validity generalization methods:
Implications
for the situational specificity hypothesis. Journal of Applied
Psychology,
78(1), 312.
Schmidt,
F. L., & Hunter, J. E. (1977). Development of a general solution to
the
problem of validity generalization. Journal of Applied Psychology, 62,
529540.
*Stanley,
R., Boshoff, K., & Dollman, J. (2007). The concurrent validity of
the 3day
physical activity recall questionnaire administered to female
adolescents aged
1214 years. Australian Occupational Therapy Journal, 54, 294302.
Trombly,
C. A., & Ma, H. (2002). A synthesis of the effects of occupational
therapy
for persons with stroke, part I: Restoration of roles, tasks, and
activities.
American Journal of Occupational Therapy, 56, 250–259.
Valentine,
J. C., & Cooper, H. (2003). Effect size substantive interpretation
guidelines: Issues in the interpretation of effect sizes.
*
*Studies
constituting the sample that we analyzed.