HOAX?

Bob Stake1

 
      Checking the long list of plans sometime last winter, Lizanne asked a title for my closing remarks.  Having no idea, I pondered.  She scribbled on, finally saying, "Tell me, Bob, of all the great people you have worked with in CIRCE, whose ideas influenced you most?"  Now I had two ponderables.  She went on writing.  Finally, I said, "Maybe Hoke's."  Turning to me, she said, "Hoax?"  "Yeah, I guess, Hoke's?"  "With a question mark?"  "I suppose so." 

 From the bottom of my heart, I want to thank Lizanne for creating these two days, a marvelous deployment, and her staff, Connie, Karen, Trudy, Susan, Diane, and Beena, with great help from Elizabeth Easley, and hard work from Edith Cisneros, Marya Burke, Rita Davis and Terry Souchet.  I appreciate the generosity of the Jack Easley Endowment, the Daniel A. Alpert Fund, and the Bureau of Educational Research.  And I thank you all for coming, for speaking, for making it an honor for me, a delight for friends and family, a reunion for all. 
 

      To Teach.  My mother and father were teachers.  It only lasted a year for Grandpa Earl because some of the boys in his one-room schoolhouse were bigger than he.  My mother taught for 15 years, the early ones in a sod schoolhouse in western Nebraska.  Her grandfather had gone to the Genoa Indian Reservation in 1851 to bring agricultural methods to the Pawnee boys. 

 But I had no aspiration to teach.  To pontificate, yes.  To "show off," yes.  But the thought did not occur to me until I needed a post-baccalaureate year to attain my Navy ROTC commission.  I told Dean Henzlik I should use my non-Navy, available, upcoming, 28 credit hours to get a teaching certificate.  He said, "Why?"  I was stumped for an answer--but said, "I might get Navy duty with training responsibilities."  He said, "That's an answer," and arranged it.  So two semesters later, I was certified and commissioned at the same time, one day before I married Bernadine. 

 Bernadine soon was teaching in San Diego while I sailed Korean waters.  Back in San Diego, I was impressed by one of my eldest cousins, Richard Madden, a professor of education at San Diego State and co-author of the Stanford Achievement Tests.  Richard would spread his charts on a table in his study and show me test score trendlines of the children of Cherry Creek, Colorado, explaining how changes in the teaching of spelling had reconfigured the scores.  I marveled at Richard finding connections between teaching and testing.  Twenty years later, I hadn't found such connections myself, nor had my colleagues.  For his dissertation research,  I talked a bright, mature Aussie, Norman Bowman, into searching for present-day Richard Maddens, the practitioners so immersed in testing and curriculum that they could actually use the school's testing program diagnostically.  Like Diogenes, he found none. 

 And ten years later, for her dissertation research, I persuaded a bright, mature Brazilian, Penha Tres, to study the interactive knowledge of testing and curriculum improvement at the Office of the Illinois Superintendent of Public Instruction, in Springfield, to find the people who would understand both assessment and teaching, so that tests would be built partly to serve a diagnostic purpose.  And she found none.  And although the efforts to build the IGAPs were harmonious with those of curriculum professors here at the University of Illinois and at other leading teacher training institutions, there was no study of consequential validity--so that it could be said with assurance that improvements in teaching will be manifest in changes in test scores. 

 Paraphrasing Milton:  They also serve who leave the null hypothesis tenable.  Just a few hours ago, Michael Scriven noted that it is a sophisticated researcher who beams with pride having, with thoroughness and diligence, found nothing there. 
 

     Understanding Testing.  In 1954, my cousin would not let me enroll at San Diego State, saying there were better places to learn about testing.  I was accepted for graduate school here at the U of I but, in a scorching August visit, somehow failing to meet Tom Hastings and Lee Cronbach, finding rent an unbelievable $125 a month, Bernadine and Jeff and I settled elsewhere. 

 A year later, a graduate assistant at the Educational Testing Service, I continued my fascination with test items.   It was a while before I realized these items were just another version of showing off.  I could devise analogy items that stumped even the cleverest of my friends. 

 As a political venture, I saw testing as "emancipatory."  Poor youngsters who could solve analogy items could share in the affluences of society.  It was another while before I realized that for every child enriched, many were further locked-out of privilege, lured by the winsome foils of analogy items. 

 Let me assure you that these tests had respectable validity in the sense that, for a large heterogeneous group of youngsters, the scores correlated well with subsequent grades in school.  But as many of the critics of testing have noted, such test scores did not correlate well with success in later work, with practical ingenuity, aesthetic sensitivity, raising a family, being a good citizen, or becoming an effective teacher.   And many of the people who became good at these other things found life harder because their test scores suggested their aspirations were less worthy of support. 

 My studies at Princeton concentrated not on test development but on psychometrics, mathematical theories of measurement of human characteristics.  I wasn't very good at this stuff and it could be said that that was the reason I not only got out of testing, but became less reliant on quantitative measurement.  Who knows?  I returned to my alma mater, the University of Nebraska, to teach and do research.  It is hard to believe these days, but Charles Neidt had held a tenure track position open for me for three years while I was getting a doctorate. 

 There at Nebraska, I did my research on instruction.  I don't know why.  I found it good to design highly structured, experimental, standardized studies of teaching.  Somehow word got to Tom Hastings, whom I still had not met.  Tom needed someone to succeed Dave Krathwohl and Phil Runkel as his assistant for the Illinois Statewide Testing Program, headquartered across the alley from Newman Hall.  And what he wanted was someone who knew instruction and testing and might help make the Illinois tests more relevant to teaching.  Who he needed was Richard Madden, but he got me.  I arrived as he and Lee Cronbach were answering a US Office of Education invitation to create a National Educational Laboratory on campus, a CIRCE.  Tom wanted it to emphasize connections between teaching and testing, Lee wanted it to emphasize connections between curriculum development and evaluation.  I was so out of it that I doubt if a single paragraph I wrote got included in the proposal submitted by Lee, Tom, and Jack Easley. 

 One day as Tom and I were crossing a bike path on Wright Street, he asked me, "Now that you have learned to look both ways, what do you want to accomplish at Illinois?"  I said, I never think that way."  It wasn't a premonition of going beyond goal-based evaluation.  It was more like realization that success came easiest by setting low goals. 
 

     The Company.  At CIRCE, Tom and I tried to help Mike Atkin, Bill Creswell and a number of national curriculum project leaders with their evaluation obligations.  Jack somehow managed to get student responses analyzed and back in two weeks to Max Beberman's lesson writers, but that was still too slow.  And time and again, the longer evaluations showed no significant differences.  One answer was to do studies too small for inferential statistics.  That may have been the origin of case studies. 

 Or it may have been the day Lee got out of the car at the Union, saying, "What this field needs is a good social anthropologist."  It took me at least ten years to get an inkling of what he meant.  But I didnít wait that long to pay attention to what Lou Smith and Barry MacDonald and Ulf Lundgren and Mariann Amarel were doing. 

 Early days at CIRCE were heady times.  Jim Wardrop, Gene Glass and Doug Sjogren came aboard, then Ernie House, bringing Joe Steele, Tom Kerins, and Steve LaPan.  Tom Maguire and Peter Taylor were first in a stream of splendid graduate students, Dennis Gooler and Mary Ann Bunda, Terry Hopkins and Duncan McQuarrie.  And so many more, Jennifer McCreadie, Oli Proppe, Jim Pearsol, Judy Dawson. And on and on. 
 
 Off and on for many years, Gordon Hoke and Terry Denny hung out with us;  Claire Brown, Arden Grotelueschen, Jim Raths, Bob Linn.  Bernadine headed a three-year evaluation of the National Center for Sex Equity Education in Fort Lauderdale.  Jacquie Hill, Buddy Peshkin, Wayne Welch, Jim Sanders, Lou Smith, and Rob Walker helped with Case Studies in Science Education. 

 And a stream of head-turning visitors from far continents:  Ulf Lundgren, Barry MacDonald, Peter Fensham, Helen Simons, Arieh Lewy, David Hamilton, John Nesbitt, Royce Sadler, Marli Andre, Don Hogben. 

 All of them, locals and aliens, wonderful teachers.  From these, my personal mentors, I skimmed away over three thousand major ideas--acknowledging seven, six if you don't count Cronbach's curbside remark.  The reason I said "Hoke's" was that over a thousand of the ideas were from Gordon alone, which he in turn had stolen, but he always included the citation. 

 I didn't learn how to teach in my semesters at Nebraska.  I learned from  you.  And I learned from my mistakes, at which you didnít laugh.  Well, Ernie did.  But most of you just smiled and said, "That's real nice." 

     Metaevaluation.  So I gradually learned that educational evaluation can't be done.  It can not be "done done."  It's an impossible dream.  If Ten is full-and-accurate determination of the value of an educational program, we sometimes get to Three, usually not past Two.   The RFP calls for Michelangelo, and we are finger-painting.  (I think I stole that line from you, Michael.) 

 We differ among ourselves as to the meaning of the words, "to evaluate," and we advise folks to do a lot of different things in the name of "evaluation."  But speaking simply, it means to determine the quality of something.  Everybody evaluates all the while:  "You there are wearing your best shoes."  "That melon at lunch tasted so good."  "Although fictitious, this morningís accolades were so gracefully put!"  Or, the student in my 498 class this spring writing in her journal, "How can we learn to represent teaching quality when he won't tell us what it is?"  Each of us is a constant producer of evaluations. 

 But professional evaluation, where we move well beyond common sense and impression, when we reject simplistic indicators; professional evaluation, where we propose to combine the discipline of the connoisseur, the logic of the philosopher, the acuity of the ethnographer, and the moral sensitivity of the judge.  We are promising something we cannot do. 

 I look back over CIRCEís 34 years and wonder if we ever came close.  We have spun some provocative webs.  We have been temporarily familiar with a lot of teaching.  We have fashioned some penetrating issues, told some good stories, written some handsome reports, occasionally been useful.  But how close did we come to pinning down the merit and shortcoming of those programs? 

 I donít consider this an exercise in postmodern cynicism.  Oh, I have my poststructuralist streak.  Constructivism has its thrall, sometimes as tasty as ice cream.  But I walk down the stairs a modernist.  What I say today is, I believe, however deluded, a realistic metaevaluation of the field. 
 

    Analysis.  I am not put off because we find a thousand notions of what good teaching is.  Complex representations, we can handle.  I am put off because we cannot agree that the whole is greatly different from the sum of its parts.  And the embracing view of value is not nicely represented by a few exquisitely selected criteria.  We are especially weak when we focus on but a few of the many parts. 

 For diving:  the aggregate of perpendicular entry and small splash do not tell the quality of the dive.  For creative writing:  grammar, sequentiality, illustration, and closure do not tell the quality of the essay.    And description and judgment of antecedents, transactions, and outcomes do not encompass the quality of the innovative instruction. 

 We cannot take solace in the fact that the most of the world doesnít want to know more about diving, or essays, or instruction.  There is market for exit polling, for Dow Jones, for the vignette,  for the sound bite, for simplistic representations.  As Linda Mabry said this morning, all indicators are misrepresentations, but worse because they satisfy the curiosity for knowing the real meaning of the matter.  Even the best of our evaluations allow people to falsely presume that a complete evaluation has been done. 

 We should not be satisfied that quality of teaching is known by student ratings, or by student test scores, or by peer reviews, or by teacher of the year awards.  Teaching is a situationally responsive act, a role a hundred times more complicated than the best checklist or set of standards.  Its meaning is constructed by the folks-involved every bit as much as the meaning of mathematics is constructed by children.  The value is embedded in the situation, only in small part accessible to evaluators, supervisors, or the teachers themselves.  Every child is shaped in part by teachers, for good or not, and most of the good they do, and most of the ill they do, is Godís truth alone. 

 Does that mean itís been a waste?  Of course not. We have done--not the best we might have--but many things worthy of pride. We know much more now that we did in the 60s.  Thanks to Michael especially, and to many of you toilers here today, we have real help to offer program directors and constituents, help both toward the determination of value and the facilitation of self-study.  And while preserving the connection Ralph Tyler made between the curriculum and evaluation, as Lee and others have said so persuasively these two days, we have brought democracy to the center of our conversation. 
 

     Inservice.  What did I learn?   If I were to name the biggest thing I have learned in this time it is--it's what Ernie said this morning in different words--that the program and its value are one and the same, that the meaning of an evaluand and its quality are one thing, not two. 

 When I wrote the "countenance paper," I put description on one side and judgments on the other.  But it was a mistake to imply that descriptive data and judgment data should be pulled apart.  As we observe teaching, learning, the politics and the culture of education, we simultaneously see their merit and shortcoming.  We can identify criteria and get ratings or scores, but these analytic calculations draw us away, I think, from understanding the quality of the program. 

 Our minds will analyze, analysis is a fixture, often useful to get us to attend to understudied parts, but analysis is construction as much as dissection.  Values analyzed can be less a refinement, more a replacement.  I continue to endorse "responsive evaluation" for its holistic mindset, responding to the activity, the complexity, the situationality and the quality of education with the fullest interpretation 180 pages will allow.  But no one approach is good enough.  As Oli Proppe said in his dissertation proposal, a dialectic among several mindsets is essential to good evaluation. 

 When we studied science education in the nation 's schools in the 70s, we were up against a federal formula saying that quality is the difference between where we are and where we ought to be.   But quality is not a discrepancy.  It is an inherent, evolving, compounding of the evaluand. 

 As we have examined the quality of professional development at the Chicago Teachers Academy, we have found the merit of teaching and learning captured neither by Bill Bennettís "worst schools" soundbite, nor Bill Clinton's praise, nor Paul Vallas' probation list. 

 In education reports of all kinds, the executive summary is a fiction.  Reality is at least "touched" by the description of teaching integrity and wasted opportunity in the classroom. 
 
 Beauty is in the eye of the beholder, but inseparable from the flower. 

 For me, itís been a great ride. 
 

 And Yet.  Just as my analogy items made life harder for those who scored low, formal evaluation, as we have practiced it, has made Education less effective.  If we put the power beams of consequential validity down on Evaluation, we see that we have failed to make it clear that almost everyone has too narrow a view of teaching and learning.   And that narrowness distorts the judging of our youngsters, our schools, our society and ourselves. 

 At the top of the list of deceits we have failed to expose are those of standardized testing.  We have failed to show that the best testing has regularly not been an indication of what students can do, nor of the quality of the educational system, nor of what the teachers or the society should do next. 

 According to Gallup Polls in the 60s, the populace had high confidence in our schools--now, grave doubts.  In some ways, the schools are not as good as they were; in a few ways, they are better.  But the image of the schools has changed, partly because the schools wonít adapt to an evolving society, partly because many people donít want them to change as much as they do.  An awful lot of people feel they know how to run the schools better.  And a good part of the false confidence is at our doorstep.  We most responsible for the formal evaluation of education have not provided better representation of teaching quality than standardized test scores. 
 

     Homework.  So I ask your help.  My colleagues are passing about the room handing out forms (see attachment here).  Here is what we are going to do.  We are going to do a study to help legitimate a fact that almost everyone in this room knows: that you cannot use standardized student achievement test scores to determine quality of teaching. 

 Each of you--should you accept the mission--will approach the principal of an elementary school, and, after pledging confidentiality and gaining rapport, ask him or her to identify one situation in which a quite good teacher has preceded or succeeded a teacher clearly not so good.  That is, identify a classroom in which the teacher one year and the teacher the next year were of quite different teaching capability.  Then you need to get the principal to release to you the test scores for that one room for the two years.  With assurances that the assignment of students to that room has not changed,  we can expect there to be a random plus and minus difference in means across the two years.   Some of you will want to make several comparisons. 

 By finding no grounds for rejecting the null hypothesis, we will have a handsome citation that either student testing does not indicate teaching quality, or that principals do not know good teachers from bad. Obviously we will have to deal with several complications here, but it is time we made the citation. 

 I am serious.  I have only a few research projects still to do, maybe one.  Enough of vision; it is time for damage control.  The aim is clear, to help improve popular conceptualization of school quality.  I really would appreciate your help.  This is no hoax. 
 

     Last word.  We are not gathered here for commencement.  Things are winding down for this teacher.   The archivists will soon be by.  They will look in my files and on my shelves, and find precious little to preserve.  But it is not they who evaluate a career.  What matters is in the eyes, the minds, and hearts of those I see before me today.  In the words of Jennifer Green yesterday, "Let's, you and I, 'toil on.'" 

 Thank you. 

 Note: 

 _____________ 
1.  This is my presentation to conclude a splendid symposium honoring my career at the time of my formal retirement on May 9, 1998.  BACK to document]


 
  BACK to document]