Let’s change tack a little bit here and discuss a specific set of sensory tests: overall difference tests.
Deciding whether two samples of beer are different is not as easy as it may seem. Everyone perceives their senses slightly different than others, so what one person may find to be a noticeable difference may only be detected by few, if any, others. Various types of bias may also lead people to find differences that don’t exist. On top of this, the need for accuracy in your data means that you often need more than just a few people to be able to truly say whether there is a difference. So, as with all laboratory procedures, there are standardized methods and tests that are used when searching for differences in food and beverage systems. Even under the heading of “overall difference tests” there are a number of different tests that can be used, each with their own pros and cons.
Overall difference tests are used to find whether there is any detectable difference between two samples. Where exactly that difference originates is not necessarily part of the goal of the difference test, although you can usually pull out some hints to help guide your progress. This type of test differs from the more specific “attribute difference test” which seeks to determine whether a difference exists on the basis of one specific aspect of the sample, whether it is the color, the bitterness, the phenolic aroma, etc. I’ll discuss attribute difference tests later, but before we move on to the tests themselves, a word about error first.
In statistical tests such as these, there are essentially two types of error: α-error, and β-error. α-error is a numerical representation of the risk you are willing to accept for the possibility of finding a false-positive, or finding a difference when one doesn’t exist. β-error is the same type of numerical representation, but it signifies the risk you accept for possibly finding a false negative, or missing a real difference that exists between the samples. In practical situations, you must balance which risk you want to minimize over the other, since minimizing both requires many more panelists and samples than most production environments can accommodate. For overall difference tests it is usually the alpha that is minimized, while the β-risk is allowed to be large to keep the number of assessors reasonable. The default value for α is usually 0.05 meaning you, as the administrator, are accepting the possibility that there is a 1/20 chance that the results will indicate a difference when one doesn’t actually exist. An α of 0.05 isn’t required by any means, but it usually offers a good balance between risk-management and panel size.
What follows is a breakdown of a few of the more commonly used types of tests that can be used to find an overall difference between test samples.
Triangle Test
Description: Probably the most commonly used difference test across the industry, the Triangle Test is essentially a 3-sample “which one of these is not like the others” test. The three samples are presented to a panelist in an identical, simultaneous, and blinded fashion. The panelist smells and/or tastes the three samples (or otherwise follows directions) and determines which sample of the three is the odd-man-out and their answer is marked on their ballot. Each panelist taking the test should see their three samples in randomized order, and the occurrence of each sample type (test vs. control) as the “odd sample” should be balanced to mitigate bias. In other words, for every panelist who has one test sample and two controls, there should be another panelist who has the opposite. While the spirit triangle test rules say you should use a different panelist for each assessment (20-40 panelists), the practicality sometimes necessitates the “re-use” of panelists for multiple assessments. This is acceptable in a pinch, but care must be taken to maintain a minimum of bias potential and continued randomization is essential. As few as 12 panelists (or assessments) can be used when it is felt that the differences will be large enough.
Uses: The triangle test is useful when the potential differences in the product are of indeterminate complexity, such as when changes in production techniques or ingredients may alter the profile of the product in an unpredictable or multi-faceted way. For changes where the potential differences may be limited to only one aspect of the product’s profile (the color, the hop aroma, or the viscosity, for example), the more targeted approach of an attribute difference test may be more applicable. Can be used to monitor the sensitivity of panelists and determine their performance.
Pros: Statistically one of the most “powerful” of the difference tests, since guessing will only yield the correct answer 33% of the time. Also a relatively low number of samples need be tasted, so carry-over and fatigue can be minimal for certain products.
Cons: Can be difficult to use with samples that are prone to causing sensory fatigue, carry-over, or adaptation.
Below is a partial re-creation of the results table for the triangle test. You can use this chart to interpret your own difference tests if you’d like. “n” signifies the number of assessments in the test, and 0.1/0.05 represents the α-risk. To use the chart, match the α-risk you want with the number of assessments you have performed to find the minimum number of correct responses needed to show significance. For example, for 20 assessments at an α of 0.05, you’d need 11 or more correct responses to show a significant difference.
alpha | ||
n | 0.1 | 0.05 |
3 | 3 | 3 |
4 | 4 | 4 |
5 | 4 | 4 |
6 | 5 | 5 |
7 | 5 | 5 |
8 | 5 | 6 |
9 | 6 | 6 |
10 | 6 | 7 |
11 | 7 | 7 |
12 | 7 | 8 |
13 | 8 | 8 |
14 | 8 | 9 |
15 | 8 | 9 |
16 | 9 | 9 |
17 | 9 | 10 |
18 | 10 | 10 |
19 | 10 | 11 |
20 | 10 | 11 |
21 | 11 | 12 |
22 | 11 | 12 |
23 | 12 | 12 |
24 | 12 | 13 |
25 | 12 | 13 |
30 | 14 | 15 |
60 | 26 | 27 |
90 | 37 | 38 |
120 | 48 | 50 |
You can see that there isn’t much of a difference in the minimum number of correct answers needed to show significance between the alphas of 0.1 and 0.05, at least until you get to high n values. More noteworthy, however, is the number of correct answers needed relative to the number of assessments (n). At low n values, the number of correct answers is quite high (over 70% of the n-value), but as you increase the number of assessments the statistical “power of large numbers” kicks in and the minimum number of correct responses drops as a proportion of n. At n=120, it’s down to about 40-42%. Since the odds of a correct guess in a triangle test are 1/3, these values will always be above 33%.
Duo-Trio Test
Description: Panelists are presented with an identified “control” sample, followed by two coded (blinded) samples, one of which matches the control. Panelists are instructed to identify which sample matches the control. Two sub-types of this test exist and they are the constant reference mode, where the same sample is the control during all tests, and the balanced reference mode where the control sample is randomly balanced between the two samples being compared. A minimum of 16 or so panelists is recommended, while discrimination is greatly improved with panels of 32 or more.
Uses: Same uses as for Triangle Test.
Pros: Simpler test than Triangle Test, less confusing.
Cons: Less suitable if samples have pronounced and lingering aftertaste. Less statistically robust than the Triangle Test, since a guess has a 50% chance of being correct. This means more correct responses are needed to meet the same significance level for a panel of equivalent size (example @ α=0.05, 15/30 for Triangle, 20/30 for Duo-Trio).
Two-out-of-Five Test
Description: Panelists are presented with 5 coded samples and are instructed that two of the samples belong to one group and the other three are in another group. Panelist then indicate which samples belong to which group. When possible, samples are to be presented simultaneously. A balanced presentation is highly recommended,particularly when dealing with low numbers of panelists (ie: the number of presentations with 3 “control” samples should be balanced with the number of presentations with 3 “test” samples – balance, balance, balance).
Uses: Same uses as previous tests: process changes, ingredient changes, panelist monitoring.
Pros: Very statistically robust, since the chances of guessing the 2/5 samples correctly is 1/10, rather than 1/3 as in the Triangle Test. Fewer panelists are often needed due to this boost in power – as few as 5 or 6 panelists can be used when expected differences are large enough.
Cons: Not a good choice when carry-over and lingering aftertastes is an issue due to the high number of samples that need to be tasted.
Difference-from-Control Test
Description: Panelists are presented with a known control sample and one or more coded test samples. Each panelist rates the size of the difference between each test sample and the control sample. Any intensity scale can be used (1-5, 1-10, etc) so long as the panel is familiar with it.
Uses: Same as previous tests, but this test is also used when you want to know the magnitude of the difference and when the size of the potential difference will affect the decisions that are to be made about them. Also a good choice for difference testing when the samples are heterogeneous by their nature: baked goods, meats, etc.
Pros: Good test when carry-over and aftertastes are a problem, since only two samples are needed for the basic test.
Cons: Low statistical strength due to the higher chance of a correct guess, therefore more panelists are needed to overcome this. Although untrained panelists can be used, panelists with some training are often needed for this test since some experience with intensity ratings is helpful. A mixture of untrained and trained panelists is NOT recommended for a single test.
—
Interpretation of the data for these various tests can either be done by looking up the appropriate tables (as in the Triangle Test table above) or by analysis of variance or paired t-tests of the data itself. So that I’m not scanning pages from my own texts, here’s a link to the some a table I found on the web. I can’t find any 2/5 Test information at the moment, but since it’s the least applicable of these for beer (due to carry-over effects) I don’t think it’s a great loss.
One semi-final note: One of the things I’m always telling my panelist is that they shouldn’t get too focused on getting the answer “right”, since this is not a test of their aptitude or sensitivity. Rather, we are testing the beer and whether there is a difference between two samples, so theoretically an “incorrect” answer is just as valuable as a “correct” one.
And one final-final note, perhaps even a chance to blow your mind: just because a test’s results indicate that there is no statistically significant difference, it does NOT mean that the samples are similar. In fact, to show similarity between samples different, but (ha!) similar, tests are needed, usually with much higher n-values needed as well. Not sure if I’ll get around to discussing similarity testing for awhile, so we’ll just leave it there for now.
Question regarding your Triangle example of n=20, where you wrote that 11 or more correct to be significant diff. On Table 17.8 in Gail’s book, 20 and 11 does align with a=0.05, but the wording on the Table is “Reject the assumption of ‘no difference’ if greater then or equal to…”. So I think your example would have to be 12 or more correct to get under a=0.05. In SAS this would be via 1-PROBBNLM(.3333,20,12-1), in Excel via BinomDist(12-1, 20, .3333, True).
To get under a=0.05, yes it would need to be 12 or more. But at a=0.05, it says a minimum of 11 are required for n=20.