Musings on Test Length
-
- Moderator
- Posts: 4315
- Joined: Sun Jan 26, 2014 12:48 pm
- Division: Grad
- State: GA
- Has thanked: 218 times
- Been thanked: 75 times
Musings on Test Length
Random thoughts after writing a test for UT Invitational. Lots of times when we write tests, we compensate for larger tournaments by increasing test length - the principle being that more questions allows for greater space for differentation. However, as tournament sizes become much larger than we are used to this year, it's worth considering the limits of this principle. Clearly increasing test length indefinitely won't work - at some point even the best teams become unable to complete longer and longer tests, and as such the value of increasing test length for stratification of rankings diminishes. I don't have a solution to this problem or I would just state it, but it seems important to talk about this year.
- These users thanked the author Unome for the post (total 2):
- Giantpants (Wed Oct 28, 2020 6:49 pm) • Adi1008 (Wed Oct 28, 2020 7:42 pm)
-
- Member
- Posts: 189
- Joined: Thu Feb 07, 2019 5:42 am
- Division: Grad
- State: NY
- Pronouns: He/Him/His
- Has thanked: 150 times
- Been thanked: 159 times
Re: Musings on Test Length
I was actually just thinking the same thing the other day when writing the event supervisor post for BEARSO Geologic Mapping, since there were of course so many teams lol. Thank you Unome for raising the point. It was also a huge concern of mine when writing the test, since I'm still kind of a new test writer, and the format was still brand new then too.
On my feedback form a vast majority of submissions told me that the test was difficult because of its length, but I still received comments that some deemed the test not long enough... Obviously nothing can please everyone but I'd say its good to venture on the side of making it longer, because if a team is good enough that they can deem a "long test" short, then they probably know what they're talking about, which is good of course!
And of course, I guess it just highlights the importance of making a good difficulty spread, since throwing a lot of hard questions out there will not only alienate teams that aren't super high level, but it will just make a jumble of low scores. I guess that's the best way to make sure a test can never really be too long, properly accounting for a wide range of skills and proficiency levels amongst participating teams. It's easy to look at the "best teams" going to a competition and feel the need to make the test harder for them, but this ultimately hurts the score distribution, especially at a competition with 120 or 200 teams like UT or BEARSO.
Did that make sense? Maybe, idk. Anyone else have any other input?
On my feedback form a vast majority of submissions told me that the test was difficult because of its length, but I still received comments that some deemed the test not long enough... Obviously nothing can please everyone but I'd say its good to venture on the side of making it longer, because if a team is good enough that they can deem a "long test" short, then they probably know what they're talking about, which is good of course!
And of course, I guess it just highlights the importance of making a good difficulty spread, since throwing a lot of hard questions out there will not only alienate teams that aren't super high level, but it will just make a jumble of low scores. I guess that's the best way to make sure a test can never really be too long, properly accounting for a wide range of skills and proficiency levels amongst participating teams. It's easy to look at the "best teams" going to a competition and feel the need to make the test harder for them, but this ultimately hurts the score distribution, especially at a competition with 120 or 200 teams like UT or BEARSO.
Did that make sense? Maybe, idk. Anyone else have any other input?
- These users thanked the author Giantpants for the post:
- Mr.Epithelium (Fri Oct 30, 2020 11:36 pm)
Haverford College, Class of 2024!
Former President, Kellenberg, 2018-2020
Bro. Joseph Fox, 2014-2017
Events I'm Writing in 2023: Sounds of Music, Rocks and Minerals
Events I've Written in Years Past: Geologic Mapping, Remote Sensing
Giantpants's Userpage
Former President, Kellenberg, 2018-2020
Bro. Joseph Fox, 2014-2017
Events I'm Writing in 2023: Sounds of Music, Rocks and Minerals
Events I've Written in Years Past: Geologic Mapping, Remote Sensing
Giantpants's Userpage
-
- Exalted Member
- Posts: 306
- Joined: Thu Nov 28, 2019 3:42 pm
- Division: C
- State: CA
- Has thanked: 156 times
- Been thanked: 289 times
Re: Musings on Test Length
Although I do prefer tests I can't finish within the time frame, I appreciate that test-writers have lives. I don't mind tests that top-level teams can finish within the 50 minutes if they are at least some difficult or critical thinking questions. It's usually less the test length itself that irks me than the fact that, beyond a certain point, the test is less a gauge of overall preparedness and more of how carefully you read questions, how closely your answer was formatted to the answer key (even without considering autograder this is sometimes an issue), and how lucky you were when combing for that one random bit of trivia.
I think one way that might help with letting less experienced teams still have a good testing experience would be putting a section of easier vocab or recall questions that at least give them something if they didn't delve into the event super deeply. This way teams are less likely to be discouraged immediately by the difficulty and forget to comb through the test for easy points. If it wasn't a big portion of the score, I don't think it would cause issues with cheating.
Please, please don't artificially make tests more "difficult" by using confusing wording, small/blurry pictures, or obscure trivia that is barely relevant to the event. Students can tell.
I really like questions that rely on facts learned from preparation to arrive at a new conclusion using critical thinking, but I know those types of questions are hard to write and harder to grade, especially with longer tests. Questions that test for understanding instead of recall tend to be more cheating-proof and more fun.
I think one way that might help with letting less experienced teams still have a good testing experience would be putting a section of easier vocab or recall questions that at least give them something if they didn't delve into the event super deeply. This way teams are less likely to be discouraged immediately by the difficulty and forget to comb through the test for easy points. If it wasn't a big portion of the score, I don't think it would cause issues with cheating.
Please, please don't artificially make tests more "difficult" by using confusing wording, small/blurry pictures, or obscure trivia that is barely relevant to the event. Students can tell.
I really like questions that rely on facts learned from preparation to arrive at a new conclusion using critical thinking, but I know those types of questions are hard to write and harder to grade, especially with longer tests. Questions that test for understanding instead of recall tend to be more cheating-proof and more fun.
- These users thanked the author SilverBreeze for the post (total 2):
- RiverWalker88 (Sun Nov 01, 2020 9:26 pm) • CPScienceDude (Thu Dec 24, 2020 8:26 am)
Troy SciOly 2019 - 2023
Captain 2021-2023
Former Events: Ecology, Water Quality, Green Gen, Ornithology, Forestry, Disease Detectives, Forensics, Chem Lab, Env Chem, Sounds, Dynamic Planet, Crime Busters, Potions & Poisons, Exp Design, Towers, Mystery Arch, Reach for the Stars, Mission Possible
Captain 2021-2023
Former Events: Ecology, Water Quality, Green Gen, Ornithology, Forestry, Disease Detectives, Forensics, Chem Lab, Env Chem, Sounds, Dynamic Planet, Crime Busters, Potions & Poisons, Exp Design, Towers, Mystery Arch, Reach for the Stars, Mission Possible
-
- Member
- Posts: 20
- Joined: Wed Aug 02, 2017 4:27 pm
- Division: Grad
- State: VA
- Pronouns: He/Him/His
- Has thanked: 0
- Been thanked: 6 times
Re: Musings on Test Length
Personally, I've never changed the length of my exams based on how big the tournament is.
My goal is always to write an exam that covers a wide range of topics at varying levels of depth with an appropriately high skill cap. I believe that, if an exam does those things, then it will accurately measuring a team's knowledge, preparation, and test-taking abilities. Increasing the difficulty floor and ceiling in this way also has the consequence of ensuring all teams, including newer/weaker teams, have something to work on the whole time.
---
Frequently, "make your test longer" manifests as: pump out a ton of questions with discrete choices (multiple-choice, true-false, matching, etc.) and let the binomial distribution sort everyone out while you pick some arbitrary questions to be tiebreakers. Even if these questions are interesting (which they usually aren't), this is obviously unsustainable because you are lower bounded by the size of the field.
The better solution is to write questions that allow for a more continuous distribution of point values. This is a natural consequence of free response, short answer, fill in the blank, etc. questions. These questions don't necessarily take longer to answer than the aforementioned discrete questions, but they allow you to better differentiate between teams because now you have a lot more data to work with than just knowing which letter they bubbled.
---
There's nothing wrong with writing an exam nobody finishes if the questions in them are interesting and varied. There's always going to be people who don't finish the exam anyway. It's better to skew the distribution right than risk clumping the top teams together.
Given that most tournaments now seem to give a week for grading and scoring to happen, grading lengthy tests by hand in a timely fashion is far less of a barrier than it used to be.
---
tl;dr If your tests are good they will be good regardless of field size. If your tests are not good you should make them good, regardless of field size.
My goal is always to write an exam that covers a wide range of topics at varying levels of depth with an appropriately high skill cap. I believe that, if an exam does those things, then it will accurately measuring a team's knowledge, preparation, and test-taking abilities. Increasing the difficulty floor and ceiling in this way also has the consequence of ensuring all teams, including newer/weaker teams, have something to work on the whole time.
---
Frequently, "make your test longer" manifests as: pump out a ton of questions with discrete choices (multiple-choice, true-false, matching, etc.) and let the binomial distribution sort everyone out while you pick some arbitrary questions to be tiebreakers. Even if these questions are interesting (which they usually aren't), this is obviously unsustainable because you are lower bounded by the size of the field.
The better solution is to write questions that allow for a more continuous distribution of point values. This is a natural consequence of free response, short answer, fill in the blank, etc. questions. These questions don't necessarily take longer to answer than the aforementioned discrete questions, but they allow you to better differentiate between teams because now you have a lot more data to work with than just knowing which letter they bubbled.
---
There's nothing wrong with writing an exam nobody finishes if the questions in them are interesting and varied. There's always going to be people who don't finish the exam anyway. It's better to skew the distribution right than risk clumping the top teams together.
Given that most tournaments now seem to give a week for grading and scoring to happen, grading lengthy tests by hand in a timely fashion is far less of a barrier than it used to be.
---
tl;dr If your tests are good they will be good regardless of field size. If your tests are not good you should make them good, regardless of field size.
- These users thanked the author BrownieInMotion for the post:
- RiverWalker88 (Thu Dec 10, 2020 7:28 am)
-
- Exalted Member
- Posts: 454
- Joined: Thu Feb 21, 2019 2:05 pm
- Division: Grad
- Pronouns: He/Him/His
- Has thanked: 95 times
- Been thanked: 276 times
Re: Musings on Test Length
I think that quality is always greater than quantity when writing a test (in the limited numbers I've written). Overall however, I think that to separate good teams you need to test their memory retention, because it's gotten too easy for people to binder bash, especially in events like Astronomy. For that reason, I like long and painful tests which punish people who can't access their information as quickly.
Menomonie '21 UW-Platteville '25
Division D and proud. If you want a Geology tutor hmu.
Division D and proud. If you want a Geology tutor hmu.
-
- Member
- Posts: 129
- Joined: Fri Nov 30, 2018 10:40 am
- Division: Grad
- State: GA
- Has thanked: 21 times
- Been thanked: 78 times
Re: Musings on Test Length
I like how you think!BennyTheJett wrote: ↑Sat Dec 12, 2020 9:58 am I like long and painful tests which punish people who can't access their information as quickly.
In all seriousness, I think that lengthy tests are good especially with online tournaments, as for some events there can't be a lab (Chem Lab, Circuit Lab, Forensics) and it serves as a form of protection against teams just using the internet to try to find answers, as well as allowing for there to be an appropriate spread of teams.
Boca Raton High School -> Georgia Tech
It's About Time writer/co-writer: Golden Gate, Georgia States
Ping Pong Parachute co-ES: MIT
Florida Game On C and Fermi Questions C champion!
and Circuit Lab too I guess
It's About Time writer/co-writer: Golden Gate, Georgia States
Ping Pong Parachute co-ES: MIT
Florida Game On C and Fermi Questions C champion!
and Circuit Lab too I guess
-
- Exalted Member
- Posts: 454
- Joined: Thu Feb 21, 2019 2:05 pm
- Division: Grad
- Pronouns: He/Him/His
- Has thanked: 95 times
- Been thanked: 276 times
Re: Musings on Test Length
Side note. NEVER EVER EVER just use Wikipedia for a source (for those looking to get into test writing). Try to find sources with information competitors might not have, as everything off Wikipedia will be in binders.jaggie34 wrote: ↑Tue Dec 15, 2020 2:11 pmI like how you think!BennyTheJett wrote: ↑Sat Dec 12, 2020 9:58 am I like long and painful tests which punish people who can't access their information as quickly.
In all seriousness, I think that lengthy tests are good especially with online tournaments, as for some events there can't be a lab (Chem Lab, Circuit Lab, Forensics) and it serves as a form of protection against teams just using the internet to try to find answers, as well as allowing for there to be an appropriate spread of teams.
- These users thanked the author BennyTheJett for the post:
- MadCow2357 (Mon Dec 21, 2020 9:49 am)
Menomonie '21 UW-Platteville '25
Division D and proud. If you want a Geology tutor hmu.
Division D and proud. If you want a Geology tutor hmu.
-
- Member
- Posts: 571
- Joined: Thu Apr 26, 2018 6:40 pm
- Has thanked: 4 times
- Been thanked: 98 times
Re: Musings on Test Length
Agree that questions that require you to demonstrate understanding of the material, rather than just memorization, are desirable. They're also harder to put into a multiple choice format, and basically impossible to score reasonably in multiple choice: you do much better to have an extended-answer question worth several marks, with partial credit available. But that's harder to grade, and with everything running on Scilympiad, grading multiple choice became even easier, and grading anything else became harder.SilverBreeze wrote: ↑Sat Oct 31, 2020 2:17 pm Please, please don't artificially make tests more "difficult" by using confusing wording, small/blurry pictures, or obscure trivia that is barely relevant to the event. Students can tell.
I really like questions that rely on facts learned from preparation to arrive at a new conclusion using critical thinking, but I know those types of questions are hard to write and harder to grade, especially with longer tests. Questions that test for understanding instead of recall tend to be more cheating-proof and more fun.
(Plus I find mathy-type questions are harder to answer with typing than writing. YMMV.)
Agree with those who have said that once a test is long enough that the top teams can't complete it, adding extra length doesn't help differentiate between teams.
Suppose you have a test which is entirely multiple choice. Suppose that the difficulty of the questions is such that the top teams take 30s on average per question, meaning that top teams can complete 4 questions per minute (2 people working independently). That gives you 200 questions in a 50 minute test. Going beyond that doesn't help. You could introduce more questions by making them easier, so the good teams can bomb through them faster, but that makes it a reading speed test rather than a test of the subject matter.
So on this multiple choice test, you have 200 available points. If team scores are uniformly randomly distributed, 17 teams in your competition will give you a >50% chance of a score collision requiring a tiebreak (this is an example of the "birthday problem"). In reality, expected scores are clumped, so the probability of needing a tiebreak is higher.
Have 200 teams (quite easy with remote scilympiad competitions) and you expect 73 ties with uniformly random scores, and even more ties with realistic clumping. Sure - you can always break ties. Have a 200 point test where you break ties on all questions in reverse order, and all your ties are broken unless two teams produce exactly the same pattern of answers. Break ties after that on the time the team entered their last answer, and you've got no ties. But it's also true that your method of breaking ties is basically random. It's easy enough to pick out a few tie-break questions that are more difficult, understanding-testing questions, but nobody can realistically order all 200 questions in order of difficulty.
How much does this really matter, though? Take BEARSO Div C, with ~200 teams, and look in the middle of the field. You've got multiple instances of teams separated by a point or two, and some instances of teams having the same point total. OK. Those teams basically did as well as each other. The ranking is pretty meaningless, but in basically any distribution, people that rank 95 out of 200 and 105 out of 200 are almost identical, because most things are pretty normal-looking.
You'd like the medals to have some meaning - you'd like the choice of whether someone finished first or fifth in an event not to be random - but if you've made the test hard enough, that'll probably happen naturally. Sure, there's a point where the field gets sufficiently large that a one-hour test can't differentiate between the top scorers. About 500 people get a 1600 on the SAT every year. That's OK.
(Although I might argue that it would be marginally better for SO to score ties as ties, rather than introduce random tiebreakers.)
- These users thanked the author knightmoves for the post:
- MadCow2357 (Mon Dec 21, 2020 9:50 am)
-
- Exalted Member
- Posts: 454
- Joined: Thu Feb 21, 2019 2:05 pm
- Division: Grad
- Pronouns: He/Him/His
- Has thanked: 95 times
- Been thanked: 276 times
Re: Musings on Test Length
I like the idea of having tiebreaker questions that are in depth and make you think more, separating the teams with better reasoning. I dislike when people do "sudden death from the back" or "first correct question wins the tie".knightmoves wrote: ↑Wed Dec 16, 2020 11:17 amAgree that questions that require you to demonstrate understanding of the material, rather than just memorization, are desirable. They're also harder to put into a multiple choice format, and basically impossible to score reasonably in multiple choice: you do much better to have an extended-answer question worth several marks, with partial credit available. But that's harder to grade, and with everything running on Scilympiad, grading multiple choice became even easier, and grading anything else became harder.SilverBreeze wrote: ↑Sat Oct 31, 2020 2:17 pm Please, please don't artificially make tests more "difficult" by using confusing wording, small/blurry pictures, or obscure trivia that is barely relevant to the event. Students can tell.
I really like questions that rely on facts learned from preparation to arrive at a new conclusion using critical thinking, but I know those types of questions are hard to write and harder to grade, especially with longer tests. Questions that test for understanding instead of recall tend to be more cheating-proof and more fun.
(Plus I find mathy-type questions are harder to answer with typing than writing. YMMV.)
Agree with those who have said that once a test is long enough that the top teams can't complete it, adding extra length doesn't help differentiate between teams.
Suppose you have a test which is entirely multiple choice. Suppose that the difficulty of the questions is such that the top teams take 30s on average per question, meaning that top teams can complete 4 questions per minute (2 people working independently). That gives you 200 questions in a 50 minute test. Going beyond that doesn't help. You could introduce more questions by making them easier, so the good teams can bomb through them faster, but that makes it a reading speed test rather than a test of the subject matter.
So on this multiple choice test, you have 200 available points. If team scores are uniformly randomly distributed, 17 teams in your competition will give you a >50% chance of a score collision requiring a tiebreak (this is an example of the "birthday problem"). In reality, expected scores are clumped, so the probability of needing a tiebreak is higher.
Have 200 teams (quite easy with remote scilympiad competitions) and you expect 73 ties with uniformly random scores, and even more ties with realistic clumping. Sure - you can always break ties. Have a 200 point test where you break ties on all questions in reverse order, and all your ties are broken unless two teams produce exactly the same pattern of answers. Break ties after that on the time the team entered their last answer, and you've got no ties. But it's also true that your method of breaking ties is basically random. It's easy enough to pick out a few tie-break questions that are more difficult, understanding-testing questions, but nobody can realistically order all 200 questions in order of difficulty.
How much does this really matter, though? Take BEARSO Div C, with ~200 teams, and look in the middle of the field. You've got multiple instances of teams separated by a point or two, and some instances of teams having the same point total. OK. Those teams basically did as well as each other. The ranking is pretty meaningless, but in basically any distribution, people that rank 95 out of 200 and 105 out of 200 are almost identical, because most things are pretty normal-looking.
You'd like the medals to have some meaning - you'd like the choice of whether someone finished first or fifth in an event not to be random - but if you've made the test hard enough, that'll probably happen naturally. Sure, there's a point where the field gets sufficiently large that a one-hour test can't differentiate between the top scorers. About 500 people get a 1600 on the SAT every year. That's OK.
(Although I might argue that it would be marginally better for SO to score ties as ties, rather than introduce random tiebreakers.)
Menomonie '21 UW-Platteville '25
Division D and proud. If you want a Geology tutor hmu.
Division D and proud. If you want a Geology tutor hmu.
-
- Moderator
- Posts: 4315
- Joined: Sun Jan 26, 2014 12:48 pm
- Division: Grad
- State: GA
- Has thanked: 218 times
- Been thanked: 75 times
Re: Musings on Test Length
The difficulty is that tiebreakers have to effectively tiebreak evenly matched teams across the entire spectrum, from top teams to the very bottom. Just choosing in-depth questions can easily backfire, since often they'll end up being unanswered by the teams toward the bottom of the stack.BennyTheJett wrote: ↑Wed Dec 16, 2020 11:48 amI like the idea of having tiebreaker questions that are in depth and make you think more, separating the teams with better reasoning. I dislike when people do "sudden death from the back" or "first correct question wins the tie".knightmoves wrote: ↑Wed Dec 16, 2020 11:17 amAgree that questions that require you to demonstrate understanding of the material, rather than just memorization, are desirable. They're also harder to put into a multiple choice format, and basically impossible to score reasonably in multiple choice: you do much better to have an extended-answer question worth several marks, with partial credit available. But that's harder to grade, and with everything running on Scilympiad, grading multiple choice became even easier, and grading anything else became harder.SilverBreeze wrote: ↑Sat Oct 31, 2020 2:17 pm Please, please don't artificially make tests more "difficult" by using confusing wording, small/blurry pictures, or obscure trivia that is barely relevant to the event. Students can tell.
I really like questions that rely on facts learned from preparation to arrive at a new conclusion using critical thinking, but I know those types of questions are hard to write and harder to grade, especially with longer tests. Questions that test for understanding instead of recall tend to be more cheating-proof and more fun.
(Plus I find mathy-type questions are harder to answer with typing than writing. YMMV.)
Agree with those who have said that once a test is long enough that the top teams can't complete it, adding extra length doesn't help differentiate between teams.
Suppose you have a test which is entirely multiple choice. Suppose that the difficulty of the questions is such that the top teams take 30s on average per question, meaning that top teams can complete 4 questions per minute (2 people working independently). That gives you 200 questions in a 50 minute test. Going beyond that doesn't help. You could introduce more questions by making them easier, so the good teams can bomb through them faster, but that makes it a reading speed test rather than a test of the subject matter.
So on this multiple choice test, you have 200 available points. If team scores are uniformly randomly distributed, 17 teams in your competition will give you a >50% chance of a score collision requiring a tiebreak (this is an example of the "birthday problem"). In reality, expected scores are clumped, so the probability of needing a tiebreak is higher.
Have 200 teams (quite easy with remote scilympiad competitions) and you expect 73 ties with uniformly random scores, and even more ties with realistic clumping. Sure - you can always break ties. Have a 200 point test where you break ties on all questions in reverse order, and all your ties are broken unless two teams produce exactly the same pattern of answers. Break ties after that on the time the team entered their last answer, and you've got no ties. But it's also true that your method of breaking ties is basically random. It's easy enough to pick out a few tie-break questions that are more difficult, understanding-testing questions, but nobody can realistically order all 200 questions in order of difficulty.
How much does this really matter, though? Take BEARSO Div C, with ~200 teams, and look in the middle of the field. You've got multiple instances of teams separated by a point or two, and some instances of teams having the same point total. OK. Those teams basically did as well as each other. The ranking is pretty meaningless, but in basically any distribution, people that rank 95 out of 200 and 105 out of 200 are almost identical, because most things are pretty normal-looking.
You'd like the medals to have some meaning - you'd like the choice of whether someone finished first or fifth in an event not to be random - but if you've made the test hard enough, that'll probably happen naturally. Sure, there's a point where the field gets sufficiently large that a one-hour test can't differentiate between the top scorers. About 500 people get a 1600 on the SAT every year. That's OK.
(Although I might argue that it would be marginally better for SO to score ties as ties, rather than introduce random tiebreakers.)