6  Effect Size: Explanation and Guidelines

Effect size is a simple idea that is finally gaining traction. It refers to a class of statistics that quantify the magnitude of a relationship or difference, independent of sample size. Most effect size statistics are standardized, so a given effect size statistic can be compared directly with that same type of effect size statistic from other analyses—or even from other studies that sample the same or similar populations.

The effect being measured can be either a difference (such as the difference between an experimental-group and a control-group mean, or the difference in number of events between groups) or an association (e.g., correlations). Different effect size statistics are computed in different ways; this means that we cannot usually directly compare one effect size statistic to an other type of effect size statistic. However, the same type of effect size can be compared across different analyses or studies, and in many cases, effect size measures can be converted from one form to another (see Section 6.3).

Effect sizes are descriptive statistics. For measures of the size of an association (like a correlation), an effect size statistic may assume a linear relationship1, but they don’t assume, e.g., that the population is normally distributed. Since they make few assumptions, effect size statistics are inherently robust.

Effect size statistics can complement significance tests. Significance is, of course, a yes-or-no indication of whether there is “enough” of a difference/association relative to noise: An effect is either significant or not; there are no gradations to significance. Effect size statistics do show gradations and so can be used to properly provide the nuance that people seek when they report that something is “very” or “slightly”—or even “almost”—significant. (As noted in Section 6.2 below, effect size statistics are often described as being “small,” “medium,” or “large,” but this valuation of them doesn’t—well, shouldn’t—carry anything but an arbitrary weight.)

Effect sizes can also be reported with confidence intervals, providing an informal test of significance. Since an effect size measures magnitude, while a significance test determines whether an effect is “not zero,” an effect is likely significant if its 95% confidence interval does not include zero. However, statistical significance still depends on factors such as model specification and the inclusion of covariates.

6.1 Common Effect Sizer Statistics

6.1.1 Mean Differences

These measure the distance between two or more means. Like most effect size statistics, they are also standardized (measured in terms of standard deviations) so they can be compared between studies.

Cohen’s d

One of the most commonly used effect size statistics is Cohen’s d, which expresses the standardized difference between two group means:

\[\text{Cohen's }d = \frac{\text{First Mean}-\text{Second Mean}}{\text{Pooled }SD}.\]

We combine (or “pool” the SDs because there are two of them (one SD for each mean). To do this, we essentially take the average of the two SDs2.

Therefore, Cohen’s d is presented in terms of standard deviations. A Cohen’s d of 1 means that the means are one standard deviation apart.

You may remember that z-scores are also presented in terms of standard deviations—that a z-score of 1 means that that person’s score is one standard deviation away from the mean. This isn’t a coincidence and means that Cohen’s d can be looked at as a z-score.

Cohen’s f and f2

Cohen introduced f as a measure of effect size for F-tests, specifically to quantify differences among three or more means. In contrast, he developed d to measure the effect size between two means. The exact formula for computing f varies slightly depending on the number of levels in the factor and the variance structure.

To extend this concept to more complex models, Cohen introduced , which applies not only to ANOVA-family models but also to general(ized) linear regression. The primary distinction between f and is that is simply f squared. Cohen recommended using for complex models because it aligns with how other parameters, such as variance-explained measures, are typically computed using squared values.

An important advantage of is its flexibility: it can be used to assess the effect of a single predictor or a set of predictors, whether or not other variables in the model have been controlled for or partialed out.

More about Cohen’s f can be found at this Statistics How to page.

Other Measures of Maen Differences

Cohen’s d is not the only measure of the effect size of mean differences—although it is the most common. Two others—Hedges’ g and Glass’s Δ —are worth mentioning. All three are all standardized effect size measures used to quantify the difference between two groups in terms of standard deviations, but they differ slightly in calculation and applicability.

Table 6.1: Common Effect Size Measures of Mean Differences
Aspect Cohen’s d Hedges’ g Glass’s Δ
Denominator Pooled standard deviation Pooled standard deviation with small-sample correction Control group standard deviation
Use Case Large samples, equal variances Small samples Unequal variances
Correction Factor None Corrects for small sample bias None
Applicability Widely used in social sciences More accurate for small samples Best for heteroscedastic data

Summary

  • Use Cohen’s d in large-sample studies with equal variances.
  • Use Hedges’ g to correct for bias in small samples.
  • Use Glass’s \(\Delta\) when group variances are expected to differ substantially.
1. Cohen’s d
  • Cohen’s d measures the standardized mean difference between two groups.
  • Formula: \[d = \frac{\bar{X}_1 - \bar{X}_2}{s_p}\] where:
    • \(\bar{X}_1\) and \(bar{X}_2\) are the means of the two groups.
    • \(s_p\) is the pooled standard deviation: \[s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}\]
  • Key Points:
    • Assumes equal variances between the groups (homoscedasticity).
    • Suitable for large samples.
    • Can overestimate the effect size for small sample sizes.
2. Hedges’ g (Correction for Small or Unequal Samples)
  • Hedges’ g is a variation of Cohen’s d that corrects for the upward bias in d when sample sizes are small (usually considered when n < 20).
  • Formula: \[g = d \times \left(1 - \frac{3}{4(n_1 + n_2 - 2) - 1}\right)\]
  • Key Points:
    • Incorporates a correction factor to reduce bias in small sample sizes.
    • Provides a more accurate effect size estimate when (n < 20).
    • For large samples, Hedges’ g converges to Cohen’s d.
    • Often used in meta-analysis where comparisons between studies of very different sizes are made.
3. Glass’s Δ
  • Glass’s Δ uses only the standard deviation of the control group ((s_2)) as the denominator, instead of a pooled standard deviation.
  • Formula: \[\Delta = \frac{\bar{X}_1 - \bar{X}_2}{s_2}\] where:
    • \(s_{2}\) is the standard deviation of the control group.
  • Key Points:
    • Useful when variances between groups are unequal (heteroscedasticity).
    • May produce biased estimates if the control group standard deviation is not representative.
    • Often applied in scenarios where the experimental treatment group might naturally have a higher variance (e.g., due to a treatment effect).

6.1.2 Proportions of Variance Explained

Cohen’s d and f measure the (standardized) difference between means. Cohen’s d measures it for two means, while Cohen’s f is used to measure it between three or more means. Both of these statistics can be as small as zero (when there is no difference) to positive infinity. Both simply represent the number of standard deviations between the means, and if the effect size is more than 1 SD, then the effect size will be greater than 1.

An other set of effect size measures are standardized differently: They measure proportions, and so can only range between 0 and 1. The ones describe in this section measure the proportion of total variance explained by a particular term in a regression model.

(Squared) Correlations

Perhaps the simplest measure of proportion of variance explained is correlations, specifically squared correlations. Squared correlations are indeed effect size statistics, and they measure the amount of variance explained in each of the two variables that is explained by their relationship compared to all of the variance in each of them.

For example, if the correlation between two variables is .50, i.e., if r = .50, then r2 = .502 = .25. In that case, the correlation accounts for 25% of the variance in each of the variables.

Eta-squared (η2) and Partial η2

The other three “proportion of variance explained” statistics are used to measure the effect size of individual terms in a linear regression model.

The first of these is eta-squared (\(\eta^2\)), which quantifies the proportion of total variance in the outcome variable that is explained by a given predictor. It is calculated as:

\[ \eta^2 = \frac{SS_{\text{Effect}}}{SS_{\text{Total}}} \]

This makes \(\eta^2\) conceptually similar to \(R^2\), which measures the total proportion of variance explained by all predictors in a regression model. Like the correlation coefficient \(r\), eta (\(\eta\)) itself can be understood as the proportion of standard deviation differences in the outcome explained by the predictor, while \(\eta^2\) represents variance explained as a proportion of total variance.

However, \(\eta^2\) has a notable limitation: it does not account for other predictors in the model. As additional terms are introduced, the individual \(\eta^2\) values for each predictor tend to decrease, since they represent only the variance uniquely attributable to each predictor relative to total variance.

To address this, researchers use partial eta-squared (\(\eta_p^2\)), which represents the proportion of variance explained by a specific predictor after accounting for other predictors in the model. Partial \(\eta^2\) is conceptually similar to partial \(r^2\), as it isolates the unique contribution of a predictor while removing variance shared with other terms.

In a one-way ANOVA (i.e., a model with a single categorical predictor), \(\eta^2\) is equivalent to the overall model \(R^2\). However, in models with more than one predictor, partial \(\eta^2\) is preferred and the overall \(R^2\) will be different than each of the partial \(\eta^2\)s.

\(\eta^2\) Compared to Cohen’s \(f\) and \(f^2\)

Cohen’s \(f\) and \(f^2\) serve a similar purpose but differ in how they handle variance:

  • \(\eta^2\) vs. \(f\) (ANOVA): While \(\eta^2\) measures the proportion of variance explained by a factor, \(f\) adjusts for unexplained variance, making it more suitable for cross-study comparisons. The relationship between them is:

\[f = \sqrt{\frac{\eta^2}{1 - \eta^2}}\]

  • Partial \(\eta^2\) vs. \(f^2\) (Regression): Partial \(\eta^2\) describes the proportion of variance explained by a predictor after controlling for other variables, while Cohen’s \(f^2\) expresses the incremental contribution of a predictor relative to the unexplained variance:

\[f^2 = \frac{R^2}{1 - R^2}\]

Since \(f^2\) explicitly models the variance explained relative to unexplained variance, it is commonly used in multiple regression, particularly for power analysis and comparing models across studies.

Thus, while \(\eta^2\) and partial \(\eta^2\) are useful for describing within-sample variance explained, \(f\) and \(f^2\) provide standardized effect size measures better suited for meta-analysis and statistical power estimation.

Table 6.2: When to Use \(\eta^2\), \(f\), and \(f^2\)
Criterion \(\eta^2\) \(f\) (ANOVA) \(f^2\) (Regression)
Use Case ANOVA (variance explained) ANOVA (standardized effect size) Regression (incremental variance explained)
Interpretation Proportion of total variance explained Standardized measure of effect size Standardized measure of predictor impact
Best for Comparing Studies? No Yes Yes
Used in Power Analysis? No Yes Yes
Inflation in Small Samples? Yes No No

Therefore:

  • Use \(\eta^2\) to describe the proportion of variance explained in ANOVA and regression models.
  • Use Cohen’s \(f\) for standardizing effect sizes in ANOVA, making them comparable across studies.
  • Use Cohen’s \(f^2\) in regression to assess the impact of specific predictors, particularly when measuring incremental effects.
  • For a single dichotomous predictor, Cohen’s d and \(\eta^2\) can be converted into each other, but for more complex models, additional transformations are required.

This Analysis Factor post gives a good further explanation of η2. Recommendations on interpreting and reporting η2 are given well in this StackExchange Q&A.

Omega-squared (ω2)

ω2 is very similar to η2. They both measure proportion of total variance accounted for by a given term in a model, but compute it in slightly different ways3. The way η2 computes it makes it systematically overestimate the size of an effect—when it is used to measure the size of the effect for the population (i.e., when inferring from the sample to the population). Although this overestimation gets smaller as the sample gets larger, it always present (until the sample is the same size as the population).

The way ω2—and partial ω2—estimate unexplained variance makes them always smaller than η2 (and partial η2). ω2 is therefore a more conservative estimate of effect size than η2. Given this, many prefer ω2 over η2.

Epsilon-squared (ε2)

The third and final member of our Greek-alphabet soup of stats to measure the proportion of variance explained is ε2. Everyone agrees that η2 overestimates the effect. Some, like Okada (2013), argue that ω2 is sometimes too conservative, underestimating the true size of an effect.

ε2 (and partial ε2) may be closer to “just right,” giving what may be the least biased estimate. Anyway, its value is always between the other two (or equal to them).

It’s worth noting that in a one-way ANOVA, ε2 is equal to the adjusted R2.

6.1.3 Odds & Risk Ratios

Odds ratios (ORs) and risk ratios (RRs) are often treated as standardized measures of effect size. Under appropriate conditions—i.e., comparable outcome definitions and baseline rates—they can be used to compare the magnitude of associations across studies.

Risk is simply another term for probability, and risk ratios represent the relative likelihood of an event between two groups. Both risks and risk ratios range from 0 to 1, much like proportion-of-variance metrics such as \(\eta^2\) or \(R^2\) Section 6.1.2.

In contrast, odds and odds ratios are unbounded above and can exceed 1. This asymmetry may make them less intuitive for some audiences, especially when comparing across studies. Nonetheless, it is statistically valid to compare odds or odds ratios across studies—though in some contexts, interpretability may be improved by transforming them to effect size statistics bounded between 0 and 1.

Two classic measures that do just this are the φ (phi) coefficient and Yule’s Q. Both are designed to quantify the strength of association between two binary variables—for example, the relationship between disease status (present/absent) and group membership (exposed/unexposed). When variables have more than two categories, related measures such as Cramér’s V are more appropriate.

The φ coefficient is defined as:

\[ \phi = \frac{AD - BC}{\sqrt{(A + B)(A + C)(D + B)(D + C)}} \]

where \(A\), \(B\), \(C\), and \(D\) refer to the cell counts of a 2 \(\times\) 2 contingency table:

\[ \begin{array}{|c|c|c|} \hline & \text{Present} & \text{Not Present} \\ \hline \text{Group 1} & A & B \\ \hline \text{Group 2} & C & D \\ \hline \end{array} \]

Despite its structural differences from the Pearson correlation coefficient, φ is mathematically equivalent to \(r\) when both variables are dichotomous. It is also frequently used as an effect size accompanying \(\chi^2\) tests, and can be computed directly as \(\phi = \sqrt{\chi^2 / n}\).

While φ is a valid and interpretable measure of association, it has notable limitations. It is sensitive to rare outcomes and can be inflated when marginal frequencies are highly unbalanced. This makes φ less suitable for studies involving rare events—such as mortality rates—where other statistics may provide more stable estimates.

Yule’s Q was developed to address these limitations. It is specifically designed to measure association between odds and is effectively a transformation of the odds ratio into a scale ranging from −1 to +1, similar to correlations. Given a 2 \(\times\) 2 contingency table, it is defined as:

\[ Q = \frac{AD - BC}{AD + BC} \]

Alternatively, it can be expressed directly in terms of the odds ratio:

\[ Q = \frac{\text{OR} - 1}{\text{OR} + 1} \]

This transformation offers a symmetric, bounded, and more interpretable summary of the magnitude of association when using odds ratios.

6.2 “Small,” “Medium,” & “Large” Effects

Like much of statistics, Cohen’s d in standardized into z-scores/SDs (remember, the formula for it is to divide it by SDs). However, simply reporting Cohen’s d without interpreting what that means has a couple of disadvantages: (a) z-scores are not intuitive for lay audiences, and (b) there are other measures of effect size than Cohen’s d—and they aren’t all measured on the same scale. Given both of these factors, in his seminal book, Statistical Power Analysis for the Behavioral Sciences, Jacob Cohen (1988) gave recommendations for how to interpret the magnitude of various effect size statistics in terms of “small,” “medium,” and “large” effects.

These “criteria” for evaluating the magnitude of an effect size have become quite popular. Indeed, the adoption of effect size statistics seems to be regulated by people’s uses and understandings of them in relation to these criteria. They therefore deserve further consideration.

6.2.1 Effect Size Criteria as Percent of Total Variance

Cohen generally defined effect sizes based on the percent of the total variance that effect accounted for4:

  • small” effects account for 1%,
  • medium” effects account for 10%, and
  • large” effects account for 25%.

I say that he generally defined them as such because he didn’t see a need to be bound to this definition, in part because he repeatedly noted—as do I here—that these criteria were arbitrary. He defined them based on percent of total variance for d and then chose “small,” “medium,” and “large” values for other effect size statistics that corresponded to those values for d.

This meant, for example, that he chose levels for correlations that don’t always match up to what one would expect by squaring the correlations to get the percents of total variances. In other words, his criteria for correlations weren’t that a “small” correlations would be r = .1 (i.e., where r2 = .01), “medium” would be r = .5, and “large” r \(\approx\) .63. In justifying this, he notes) that he is not positing these criteria levels based on strict mathematical equivalences but instead on a concerted attempt to equate the sorts of effects one would obtain with one analytic strategy with an other analytic strategy; for example, the types of effects sizes (experimental psychologists) obtain with t-tests with those they would obtain through correlations.

6.2.2 Effect Size Criteria as Noticeability of Effects

Although Cohen was thorough in his descriptions of these effect size criteria in terms of proportions of total variance, he was also careful to couch them in practical and experimental terms.

A “small” effect is the sort he suggested one would expect to find in the early stages of a line of research when researchers have not yet determined the best ways to manipulate/intervene and when much of the noise had not yet been controlled.

A “small” effect can also be considered to be a subtle but non-negligible effect: the sorts of effects that are often found to be significant in field-based studies with typical samples and manipulations/interventions. Examples Cohen gives include:

  • The mean difference in IQs between twin and non-twin siblings5,
  • The difference in visual IQs of adult men and women, &
  • The difference in heights between 15- and 16-YO girls.

A “medium” is one large enough to see with the naked eye. Example Cohen gives include:

  • The mean difference in IQs between members of professional and managerial occupations,
  • The mean difference in IQs between “clerical” and “semiskilled” workers, &
  • The difference in heights between 14- and 18-YO girls.

A “large” effect is one that is near the upper limit of effects attained in experimental psychological studies. So yes, the generalization of this criterion to other areas of science—including nursing research—is certainly not directly supported by Cohen himself.

Examples include:

  • The mean difference in IQs between college freshmen and those who’ve earned Ph.D.s6,
  • The mean difference in IQs between those who graduate college and those who have a 50% chance of graduating high school, &
  • The difference in heights between 13- and 18-YO girls, &
  • The typical correlation between high school GPAs and scores on standardized exams like the ACT.

6.2.3 Effect Size Criteria for Odds Ratios

Cohen (1988) discussed proportions (aka risks) and presented effect size measures for a proportion’s difference from .5 (Cohen’s g) and the difference between two proportions (Cohen’s h), which could be used to present the magnitude of a risk ratio; even though a risk ratio per se is already a fine effect size stat, Cohen didn’t give size criteria for risk ratios, but instead for his h.

He didn’t, however, discuss odds or odds ratios directly, and thus didn’t give his opinion about what could be considered “small,”“medium,” and “large” values for odds or odds ratios. Yule’s Q (Section 6.1.3) can be considered comparable to risk ratios, risk ratios weren’t given size criteria either.

Chen et al. (2010) nonetheless gives some guidance by providing ranges of effect size criteria for odds ratios by comparing values with criteria for “small,” “medium,” and “large” Coden’s ds. Chen et al.’s (2010) rules of thumb for “small,” “medium,” and “large” odds ratios (below) deserve especial explanation.

The size of an odds ratio depends not just on the difference in outcomes in a group (e.g., the numbers of Black woman with and without pre-eclampsia), but also the difference in outcomes in a comparison group (e.g., the numbers of non-Black women with and without pre-eclampsia). It is thus not so easy to compute simple (simplistic) rules of thumbs for the sizes of odds ratios7.

In addition, the exact values for what to consider as a “small,” “medium,” and “large” effect depend on the overall frequency, with smaller events require larger odds ratios to equate to a given level of Cohen’s d.

Nonetheless, Chen et al. (2010) presents some guidelines that can serve as guides in most cases. Using the median values suggested by their results:

  • “Small” \(\approx\) 1.5
  • “Medium” \(\approx\) 2.75
  • “Large” \(\approx\) 5

However, those suggestions can vary substantially based on the event rate in the reference group (infection rates in the non-exposed group in Chen et al.’s article):

Some Suggested Odds Ratios Corresponding to “Small,” “Medium,” and “Large” Effect Sizes Based on the Probability of the Event in the Reference Group (from Chen et al., 2010, p. 862)
Probability of Event
in Reference Group
“Small” OR “Medium” OR “Large” OR
.01 1.68 3.47 6.71
.05 1.53 2.74 4.72
.10 1.46 2.50 4.14

These estimates are based on simulations assuming a logistic model and are meant as heuristics, not rigid standards. Importantly, they illustrate that the magnitude of an odds ratio is not directly comparable across studies unless the base rates are similar.

6.2.4 A Few Words of Caution About Effect Size Criteria

As useful as it is to talk about effect sizes being “small” or “large,” I must underline Cohen’s own admonition (e.g., p. 42) that we use this rule of thumb about “small,” “medium,” and “large” effects cautiously8. He notes, for example, that

when we consider r = .50 a large [effect size], the implication that .25 of the variance is accounted for is a large proportion [of the total variance] must be understood relatively, not absolutely.

The question, “relative to what?” is not answerable concretely. The frame of reference is the writer’s [i.e., Cohen’s own] subjective average of [proportions of variance] from his reading of the research literature in behavioral science. (pp. 78 – 79)

Many people—including reviewers of manuscripts and grant proposals—take them to be nearly as canonical as p < .05 for something being “significant.” This is a real shame since effect sizes offer us the opportunity to finally move beyond making important decisions based on simplistic, one-size-fits-all rules.

Therefore, effect size measures, including Cohen’s d, are best used objectively to compare effects between studies—not to establish some standardized gauge of the absolute value of an intervention. This is indeed part of what is done in meta-analyses.

It is also what I suggest doing within your own realm of research: Just like Cohen himself did, review what appears to be generally agreed on as “small,” “medium,” and “large” effects within your research realm. These could, for example, correspond to levels of clinical significance9. Unfortunately, though, Cohen’s suggestions for his realm of research have become themselves canonized as the criteria for most lines of research in the health and social sciences.

Indeed, interventions and factors that have “small” effects can be quite important. This seems especially true for long-term changes, such as those one strives for in educational interventions or for the subtle but persistent effects of racism. Teaching a diabetic patient how to check their blood insulin may have only a small effect on their A1C levels in a given day, but can save their life (or at least a few toes) in the long run.

Given this, Kraft (2020) used a review of educational research to suggest different criteria for gauging what should be considered as “small,” “medium,” or “large” effects in education interventions. His recommendations are also presented below.

6.2.5 Table of Effect Size Statistics

Table 6.3: Effect Size Interpretations
Statistic Explanation Small Medium Large Reference
d Difference between two means 0.2 0.5 0.8 Cohen (1988, p. 25)
d For education interventions 0.05 \(<\) .2 \(\ge\) .2 Kraft (2020)
g Hedges’ modification of Cohen’s d for small samples 0.2 0.5 0.8 Hedges (1981)
h Difference between proportions 0.2 0.5 0.8 Cohen (1988, p. 184)
w
(also called φ)
χ2 goodness of fit & contingency tables.
φ is also a measure of correlation in 2 \(\times\) 2 contingency tables, and ranges between 0 and 1.
0.1 0.3 0.5 Cohen (1988, p. 227)
Cramer’s V Similar to φ, Cramer’s V is used to measure the differences in larger contingency tables.
Like φ (and other correlations) it ranges between 0 and 1.
0.1 0.3 0.5 Cohen (1988, p. 223)
r Correlation coefficient (difference from r = 0) 0.1 0.3 0.5 Cohen (1988, p. 83)
q Difference between correlations 0.1 0.3 0.5 Cohen (1988, p. 115)
η2 Parameter in a linear regression & AN(C)OVA 0.01 0.06 \(\ge\) .14
f AN(C)OVA model effect; equivalent to \(\sqrt{f^2}\) 0.1 0.25 0.4 Cohen (1988, p. 285)
f For education interventions (i.e., f equivalent for Cohen’s ds suggested by Kraft,) 0.025 \(<\) .1 \(\ge\) .1 Kraft (2020)
f2 A translation of R2 0.02 0.15 0.35 • For multiple regression / multiple correlation, Cohen (1988, p. 413);
• For multivariate linear regression, multivariate R2, Cohen (1988, p. 477)
OR Odds ratio; can be used as effect size for Fisher’s exact test and contingency tables in general. 1.5
(or 0.67)
2.75
(or 0.36)
5
(or 0.20)
Chen et al. (2010, p. 862)

6.3 Converting Between Effect Size Measures

Most effect size statistics can be converted into other ones, but the process isn’t always possible or direct (or requires additional assumptions). Table 6.5 presents the effect sizes statistics covered here that can be converted (and the conditions/assumptions required for that); Table 6.6 presents the effect size statistics that can’t be meaningfully converted.

More usefully, Table 6.4 presents the formulas for convert between the effect size statistics that can be readily & meaningfully done.

Perhaps even more usefully, this handy Excel spreadsheet can convert between Cohen’s d, r, η2, odds ratios, and area under the curve.

In Chapter 7 of their book on meta-analysis, Borenstein et al. (2011) also cover well the conversions between measures. Finally, the effectsize package for R can both compute and convert between many effect size measures, including all those mentioned here.

Table 6.4: Formulas to Convert Between Common Effect Size Statistics
From ↓
To →
Cohen’s \(d\) Hedges’ \(g\)10 Pearson’s \(r\) \(\eta^2\) \(f\) \(f^2\) \(\phi\), \(V\)
(2×2 only)
OR
(logistic approx.)
\(d\) \(g = d \cdot \left(1 - \frac{3}{4N - 9}\right)\) \(r = \frac{d}{\sqrt{d^2 + 4}}\) \(\eta^2 = \frac{d^2}{d^2 + 4}\) \(f = \frac{d}{2}\) \(f^2 = \frac{d^2}{4}\) \(\phi = \frac{d}{\sqrt{d^2 + 4}}\) \(d = \frac{\ln(\text{OR}) \cdot \sqrt{3}}{\pi}\)
\(g\) \(d = \frac{g}{1 - \frac{3}{4N - 9}}\) as \(d\) as \(d\) as \(d\) as \(d\) as \(d\) as \(d\) as \(d\)
\(r\) \(d = \frac{2r}{\sqrt{1 - r^2}}\) as \(d\) \(\eta^2 = r^2\) \(f = \frac{r}{\sqrt{1 - r^2}}\) \(f^2 = \frac{r^2}{1 - r^2}\) \(\phi = r\)
\(\eta^2\) \(d = \sqrt{\frac{4 \eta^2}{1 - \eta^2}}\) as \(d\) \(r = \sqrt{\eta^2}\) \(f = \sqrt{\frac{\eta^2}{1 - \eta^2}}\) \(f^2 = \frac{\eta^2}{1 - \eta^2}\)
\(f\) \(d = 2f\) as \(d\) \(r = \frac{f}{\sqrt{f^2 + 1}}\) \(\eta^2 = \frac{f^2}{1 + f^2}\) \((f)^2 = f^2\)
\(f^2\) \(d = 2\sqrt{f^2}\) as \(d\) \(r = \sqrt{\frac{f^2}{1 + f^2}}\) \(\eta^2 = \frac{f^2}{1 + f^2}\) \(f = \sqrt{f^2}\)
\(\phi\) or \(V\) \(d = \frac{2\phi}{\sqrt{1 - \phi^2}}\) as \(d\) \(r = \phi\) \(\eta^2 = \phi^2\)
OR \(d = \frac{\ln(\text{OR}) \cdot \sqrt{3}}{\pi}\) as \(d\)
Table 6.5: Common Effect Size Statistics That Can Be Converted into Each Other
This Effect Size Statistic… Can Be Converted To… Under These Conditions
Cohen’s d g, r, η2, f, f2, OR, φ Assumes continuous, normally distributed data; OR/φ require dichotomous approximation
Hedges’ g d Hedges’ g is a modification of Cohen’s d for small sample sizes
Pearson’s r d, f, f2, η2 Assumes linear relationship
η2 r, f, f2, d Limited to ANOVA models
Cohen’s f d, r, η2, f2 In ANOVA models
Cohen’s R^2, f, r, η2 In multiple regression contexts
Cohen’s w φ or V In 2 \(\times\) 2 tables
Cramér’s V φ, w Only for 2 \(\times\) 2; not convertible to d, f, etc.
φ r, w, V, d (with assumptions) In 2 \(\times\) 2 tables; n.b., this is an approximate d conversion
Odds Ratio (OR) d (approx.), log-OR Approximate only; assumes logistic distribution
Risk Ratio (RR) d (approx.) Approximate only; assumes log-binomial model
Table 6.6: Common Effect Size Statistics That Cannot Be Converted into Each Other
Pair Why Not Convertible
h \(\leftrightarrow\) d, f, r h is based on arc-sine transformed proportions (i.e., a different metric)
q \(\leftrightarrow\) d, f, r q compares correlations (via Fisher’s z)
V \(\leftrightarrow\) f V is for categorical data (chi-square); f for continuous
OR \(\leftrightarrow\) r, f (directly) Only approximate; depends on baseline prevalence
RR \(\leftrightarrow\) anything else (except OR) RR has no meaningful transformation outside risk models

6.3.1 A Few Notes on Conversions

In addition to simply listing the formulas for possible conversions, there are a few more points to make—and a couple more conversions that are worth knowing. Below are further considerations about converting Cohen’s f (and f2) to Cohen’s d and about converting relevant effect size stats into the t-scores and F-scores used to test mean differences.

6.3.2 Cohen’s f (and f2) to Cohen’s d

Cohen’s f2 (and f) measures the effect size of an entire model (usually an ANOVA). Cohen’s d measures the effect size between two levels of single variable11. So, in order to convert between f2 and d, we have to know more about the model. For a one-way ANOVA with two groups12, d = 2f = 2\(\sqrt{f^2}\). In this particular case, then, f = \(\frac{d}{2}\).

More generally, when there is only one term in the model:

\[f^2 = \frac{d^2}{2k}\]

It gets a bit more complicated when there are more than one terms in the model. This site covers some common situations.

6.3.3 Cohen’s d and Student’s t

This is the t in t-test. The only additional piece of information we need to know to transform between Cohen’s d and Student’s t is the sample size, N:

\[t = d \times \sqrt{N}\]

\[\text{Cohen's }d = \frac{t}{\sqrt{N}}\]

6.3.4 η2 and F-scores

This F-test score that is used in ANOVA-family models. Like the relationship between d and t, the only additional things we need to know to compute η2 from F are degrees of freedom (which are closely related to sample size). Here, though, we have degrees of freedom in both the numerator (top) and denominator (bottom13):

\[\eta^2 = \frac{F \times df_{Effect}}{F \times (df_{Effect} + df_{Error})}\]

So, η2 is dependent on the ratio of the dfs allotted to the given effect and the dfs allotted to it’s corresponding error term. Since we have the effect’s dfs in both the numerator and denominator, their effect will generally cancel out; this suggests that having more levels to a variable doesn’t appreciably affect the size of its effect. However, being able to allot more dfs to error does help us see the size of whatever effect is there. Larger samples won’t really change the size of the effects we’re measuring, but they can help us see ones that are there.

6.4 Additional Resources

Cohen’s duck


  1. In this case, it also would assume homoskedasticity. They also assume that samples are independently and identically distributed (“iid”), meaning that (a) the value of each data point in a given variable is independent from the value of all/any other data point for that variable and (b) each of those data points in that variable are drawn from the same distribution, e.g., they’re all drawn from a normal distribution.↩︎

  2. For what it’s worth, we actually take the square root of the sum of the variances, and then divide that by 2, i.e.: \(\text{Pooled }SD = \frac{\sqrt{(SD^2_{\text{First Mean}}+SD^2_{\text{Second Mean}})}}{2}\).↩︎

  3. If you’re curious about how the three measures—η2; ω2; and the next one, ε2—are computed (from Maxwell, Camp, & Arvey, 1981, cited in Okada, 2013):\[\eta^2 = \frac{SS_{b}}{SS_{t}}\] \[\omega^2 = \frac{SS_{b} - df_{b}MS_{w}}{SS_{t} + SS_w}\] and \[\epsilon^2 = \frac{SS_{b} - df_{b}MS_{w}}{SS_{t}}\] where SSb is the sum of squares between groups, dfb is the degrees of freedom between groups, SSw is the sum of squares within each group, MSw is mean sum of squares between groups, and SSt is the total sum of squares (i.e., SSt = SSb + SSw).↩︎

  4. These percents of variance accounted for are for zero-order correlations (i.e., correlations between two variables). The percent accounted for considered “small,” “medium,” and “large” for model R^2s are slightly higher (2%, 13%, and 26%, respectively).↩︎

  5. The source for this—Husén, T. (1959). Psychological twin research: A methodological study. Stockholm: Almqvist & Wiksell—was too old for me to see if he means mono- or dizygotic twins. But I tried!↩︎

  6. So, I guess a full higher education career does have a large effect on a person. And, yeah, Cohen does seem a little pre-occupied with IQ, doesn’t he?↩︎

  7. This is also true for, e.g., risk ratios, hazard ratios, means ratios, and hierarchical models.↩︎

  8. Cohen also only directly considered these criteria as they applied to experimental psychology—not, e.g., the health sciences. Indeed, he elsewhere notes that what experimental psychologists would call a “large” effect would be paltry in the physical sciences.↩︎

  9. With, say, the target level of outcome denoting a “medium” effect. Reaching \(\frac{1}{3}\) of that target could denote a “small” effect, and reaching \(\frac{2}{3}\)s more (167%) a “large” one. (This corresponds to the range between many of Cohen’s criteria. For example, criteria for r are .1, .3, and .5.↩︎

  10. Hedges’ \(g\) can be converted to any other effect size that Cohen’s \(d\) can be be converted. To convert to Hedges’ \(g\) instead of \(d\), multiply \(d\) in the given equation by \(\left(1 - \frac{3}{4N - 9}\right)\).↩︎

  11. Remember, Cohen’s d is just the difference between two means that is then standardized.↩︎

  12. Which is itself really just a t-test but using an ANOVA framework instead↩︎

  13. My mnemonic to remember which is which is to think of the saying, “The lowest common denominator.”↩︎