<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">

<channel>
	<title>Shanker Blog</title>
	
	<link>http://shankerblog.org</link>
	<description>THE VOICE OF THE ALBERT SHANKER INSTITUTE</description>
	<lastBuildDate>Thu, 23 Feb 2012 16:41:54 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/ShankerBlog" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="shankerblog" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:emailServiceId xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">ShankerBlog</feedburner:emailServiceId><feedburner:feedburnerHostname xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://feedburner.google.com</feedburner:feedburnerHostname><item>
		<title>Do Value-Added Models “Control For Poverty?”</title>
		<link>http://shankerblog.org/?p=5176</link>
		<comments>http://shankerblog.org/?p=5176#comments</comments>
		<pubDate>Thu, 23 Feb 2012 16:08:38 +0000</pubDate>
		<dc:creator>Matthew Di Carlo</dc:creator>
				<category><![CDATA[Education]]></category>
		<category><![CDATA[education research]]></category>
		<category><![CDATA[florida]]></category>
		<category><![CDATA[poverty]]></category>
		<category><![CDATA[teacher effects]]></category>
		<category><![CDATA[teacher quality]]></category>
		<category><![CDATA[value-added]]></category>

		<guid isPermaLink="false">http://shankerblog.org/?p=5176</guid>
		<description><![CDATA[There is recent controversy over the fact that Florida's new test-based model of teacher effectiveness doesn't control directly for student poverty. This is not only a mischaracterization, but, more importantly, it also ignores the bigger issues surrounding these measures.]]></description>
			<content:encoded><![CDATA[<p>There is some <a href="http://stateimpact.npr.org/florida/2012/02/17/why-poverty-is-not-included-in-the-mathematical-equation-for-teacher-merit-pay/">controversy</a> over the fact that Florida’s recently-announced value-added model (one of a class often called “covariate adjustment models”), which will be used to determine merit pay bonuses and other high-stakes decisions, doesn’t include a direct measure of poverty.</p>
<p>Personally, I support adding a direct income proxy to these models, if for no other reason than to avoid this type of debate (and to facilitate the disaggregation of results for <a href="http://shankerblog.org/?p=5056">instructional purposes</a>). It does bear pointing out, however, that the measure that’s almost always used as a proxy for income/poverty – students’ eligibility for free/reduced-price lunch – is terrible as a poverty (or income) gauge. It tells you only whether a student’s family has earnings below (or above) a given threshold (usually 185 percent of the poverty line), and this masks most of the variation among both eligible and non-eligible students. For example, families with incomes of $5,000 and $20,000 might both be coded as eligible, while families earning $40,000 and $400,000 are both coded as not eligible. A lot of hugely important information gets ignored this way, especially when the vast majority of students are (or are not) eligible, as is the case in many schools and districts.</p>
<p>That said, it’s not quite accurate to assert that Florida and similar models “don’t control for poverty.” The model may not include a direct income measure, but it does control for prior achievement (a student’s test score in the previous year[s]). And a student’s test score is probably a better proxy for income than whether or not they’re eligible for free/reduced-price lunch.</p>
<p>Even more importantly, however, the key issue about bias is not whether the models “control for poverty,” but rather whether they control for the range of factors – school and non-school – that are known to affect student test score growth, independent of teachers’ performance. Income is only one part of this issue, which is relevant to all teachers, regardless of the characteristics of the students that they teach.<span id="more-5176"></span></p>
<p>First, let’s take a quick look at the issue of “controlling for poverty.”</p>
<p>An important, albeit somewhat counterintuitive fact is that a statistical model doesn’t have to control for something <em>directly</em> in order to partially account for its influence, if it does control for a different variable that is highly correlated. If this is confusing, try a simplified example: Let’s say I want to see whether a person’s age influences their political views. In doing so, I will obviously need to control for a bunch of factors that might also be associated with those views, such as gender. But let’s also say that I don’t have data on one very important factor – earnings.</p>
<p>I do, however, include a variable measuring education, which is highly correlated with earnings. It’s not a perfect association, but if I control for education (which, in this case, I should do anyway), I will be able to account for at least some portion of the variation in political views that is associated with earnings. There will inevitably be some error – not everyone with a lot of education makes a lot of money, and vice-versa, but, over a large enough sample of individuals, much of this error will be cancelled out. In that sense, even though I’m not controlling <em>directly</em> for earnings, I can still pick up at least some of its effects.</p>
<p>The same thing goes for value-added models. Even if Florida doesn’t control for free/reduced-price lunch eligibility (in their case, they can do so, but have chosen not to), that doesn’t mean the models simply <em>ignore</em> income (or poverty). They do control for students’ prior achievement – i.e., they predict student testing gains as a function of a bunch of variables, the most important of which is a student’s actual score the previous year.</p>
<p>And, like education and earnings, there is a very strong correlation between family income and students’ absolute test scores, which means that controlling for the latter will “pick up” a lot of the variation in the former, especially over large samples of students.</p>
<p>This does not, however, mean that Florida shouldn’t control for lunch program eligibility directly. One could make a strong argument that the models should include everything possible that is associated with testing performance, especially when there are no costs in doing so. On the other hand, there is an empirical question here – do the estimates change a great deal when the variable is included? The evidence on this score depends on the <a href="http://www.caldercenter.org/UploadedPDF/1001508-Measurement-of-Teacher-Productivity.pdf">type of model used</a>, the years of <a href="http://www.dartmouth.edu/~dstaiger/Papers/KaneStaiger_2002.pdf">data available</a>, the <a href="http://www.wcer.wisc.edu/news/events/VAM%20Conference%20Final%20Papers/MeasuringEffectSizes_BoydEtAl.pdf">properties of the test</a> and other factors. Sometimes <a href="http://gse.berkeley.edu/admin/events/docs/epaa.pdf">results are different</a> overall, sometimes <a href="http://web.missouri.edu/~podgurskym/Econ_4345/syl_articles/ballou_sanders_value_added_JEBS.pdf">they’re quite similar</a>, though in both cases, at least some individual teachers’ estimates are likely affected.*</p>
<p>I’m not familiar with how Floridians made their decision to exclude the lunch measure, but I do hope they will test regularly whether their results are different when it’s included.</p>
<p>In any case, to a large extent, the inclusion of prior test scores (and other variables) does actually account for income and poverty, at least to some extent, and the free/reduced-price lunch variable is so limited that it doesn’t always add much to the power of the models. This means that it is misleading to say that the “models don’t control for poverty.” **</p>
<p>It’s also an oversimplification to the point of being a distraction.</p>
<p>The proper question about bias is more broad: Do the models adequately account for all the relevant factors that are outside of teachers’ control?</p>
<p>This is a bigger issue, and it pertains to teachers of high- and low-poverty students in all schools and districts, urban and rural, large and small. Now, it’s certainly true that many of the conditions that influence performance, such as parental involvement, oral language development, early childhood education, family stress, etc., are associated with income, but the <a href="http://shankerblog.org/?p=4781">relationship is imperfect</a>. And, many other important factors are only weakly or unrelated (especially given the limitations of the lunch program variable). Child development is <a href="http://faculty.smu.edu/millimet/classes/eco7321/papers/todd%20wolpin.pdf">cumulative and multifaceted</a>.</p>
<p>So, the answer to this more central question – whether a growth model accounts for non-teacher factors – is inherently a matter of degree. When using properly-interpreted estimates from the best models with multiple years of data (not the case in many places using these estimates for decisions), it’s fair to say that a large proportion of the non-teacher-based variation in performance can be accounted for.</p>
<p>There will, however, always be bias, sometimes substantial, affecting many individual teachers, as there would be with any performance measurement, including classroom observations. Whether or not the bias is “tolerable” depends on one’s point of view, as well as how the estimates are used (the latter is especially important among cautious value-added supporters like myself). Furthermore, as I’ve argued <a href="http://shankerblog.org/?p=1383">many times</a>, the bigger problem in many cases, one that can be partially addressed but is being largely ignored, is <a href="http://shankerblog.org/?p=353">random error</a>.</p>
<p>But that’s a separate discussion. For now, the main point is that the controversy over the role of “poverty” in education has assumed a role of unqualified importance in the debate over value-added. It’s more broad and complicated than that. Reducing it to a poverty argument is likely to be unproductive in the short and long run. It oversimplifies the potential problem of systematic bias, and also ends up ignoring the critical issues – implementation, random error, model specification, data quality, etc. – that can make all the difference.</p>
<p>- Matt Di Carlo</p>
<p>*****</p>
<p>* A related but somewhat technical issue with controlling for student demographic characteristics, which is sometimes used as a reason for excluding them, is the possibility that less effective teachers are <a href="http://www.urban.org/uploadedpdf/1001469-calder-working-paper-52.pdf">concentrated in higher-poverty schools</a>. If that’s the case, then, put simply, the models will “mistake” this for poverty effects (or those of other characteristics). I should also mention that there is a distinction here between controlling for <em>individual</em> students’ income/poverty, and classroom- and school-level poverty (e.g., the percent of students in a class or school who are eligible for free/reduced-price lunch). The latter measures help address factors such as peer effects. Whether or not classroom- or school-level characteristics have a substantial impact on results also varies by methods and data availability.</p>
<p>** In case you’re missing the irony here, consider two facts: first, opponents of value-added often point out that the models do not account for student poverty, which is a strong predictor of testing performance; second, the models admit to the relationship between scores and poverty by controlling for the former to account for the latter.</p>
]]></content:encoded>
			<wfw:commentRss>http://shankerblog.org/?feed=rss2&amp;p=5176</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Public Schools Create Citizens In A Democratic Society</title>
		<link>http://shankerblog.org/?p=5172</link>
		<comments>http://shankerblog.org/?p=5172#comments</comments>
		<pubDate>Thu, 23 Feb 2012 00:31:06 +0000</pubDate>
		<dc:creator>Jeffrey Mirel</dc:creator>
				<category><![CDATA[Democracy]]></category>
		<category><![CDATA[Education]]></category>
		<category><![CDATA[Guest Posts]]></category>
		<category><![CDATA[immigration]]></category>
		<category><![CDATA[jeffrey mirel]]></category>
		<category><![CDATA[middle east]]></category>
		<category><![CDATA[u.s. history]]></category>

		<guid isPermaLink="false">http://shankerblog.org/?p=5172</guid>
		<description><![CDATA[University of Michigan Professor Jeffrey Mirel argues that strong public schools are a primary means by which citizens learn trust, tolerance and faith in democracy, and that this is evident in the history of our nation.]]></description>
			<content:encoded><![CDATA[<p><em>Our guest author today is <a href="http://www.soe.umich.edu/people/profile/jeffrey_mirel/">Jeffrey Mirel</a>, Professor of Education and History at the University of Michigan.  His book, </em>Patriotic Pluralism: Americanization Education and European Immigrants<em>, published in 2010 by Harvard University Press, is available in bookstores and <a href="http://www.amazon.com/Patriotic-Pluralism-Americanization-Education-Immigrants/dp/0674046382/ref=sr_1_1?ie=UTF8&amp;qid=1329955891&amp;sr=8-1">online</a>. </em></p>
<p>How do you get people who hate each other learn to resolve their differences democratically? How do you get them to believe in ballots not bullets?</p>
<p>What if the answer is “public schools” and the evidence for it is in our own history during the first half of the twentieth century?</p>
<p>In the years spanning about 1890-1930, two institutions—public schools and the foreign language press—helped generate this trust among the massive wave of eastern and southern European immigrants who came to the U.S. during that time. This is not a traditional “melting pot” story but rather an examination of a dynamic educational process.<span id="more-5172"></span></p>
<p>The majority of these immigrants were dramatically different from the native born Americans they encountered here.  Most immigrants knew no English, worshipped at synagogues or Roman Catholic or Eastern Orthodox churches, and had little knowledge of democracy.  Many native born Americans viewed this “invasion of immigrants” as akin to the onslaught of the barbarians who destroyed Rome.  Indeed, some argued that these newcomers were <em>genetically</em> incapable of becoming Americans.</p>
<p>The “us and them” attitudes of many Americans toward immigrants could have set off a fissure among groups, which might have rivaled the factions that have fought for generations in the Middle East. Take Syria. Thomas Friedman (in a February 8, 2012<em>, New York Times </em><a href="http://www.nytimes.com/2012/02/08/opinion/friedman-freedom-at-4-below.html?_r=1&amp;ref=thomaslfriedman&amp;gwh=7BBC56DED79AB24315F7A3DCD02E90B5">article</a>) notes that Syria’s fractious religious and ethnic groups lack the political trust that is vital for creating democratic societies.  America, Friedman argues, provides an example of how diverse groups, many of which have deep resentments and anger about one another, still can work together politically for the good of the country.  Friedman has been making that argument for many years and he is correct.  And it may have been leaders of the public schools who engineered the mechanism that nurtured trust and tolerance among immigrants and native born Americans.</p>
<p>Public school leaders, particularly those in America’s great industrial cities, believed that their institution could transform the immigrants and their children into committed and loyal American citizens.  In the K-12 and adult education programs, these educators promoted three main things: speaking English; learning American history, particularly about such American heroes as Washington and Lincoln; and gaining knowledge about democracy as well as the rights and responsibilities of American citizenship.</p>
<p>Many historians have described the interaction of schools and immigrants as something like an abusive relationship, in which immigrants and their children were bludgeoned into assimilation.  But my research on the foreign language newspapers in Chicago and Cleveland, documented in my book <em>Patriotic Pluralism,</em> tells a different story, drawing on the voices of the immigrants themselves. The editors of virtually all the major foreign language newspapers serving such groups as Czechs, Greeks, Hungarians, Italians, Jews, and Poles in these two cities endorsed Americanization but <em>on their terms</em>.  They firmly supported key elements of Americanization but always with an ethnic twist.</p>
<p>For example, they all strongly supported adults and children learning English but they were adamant about teaching their American-born children to speak and read their ancestral tongue after school and/or in weekend schools.  The newspapers urged their readers to learn American history but they routinely supplemented the grand, national narrative with stories about immigrants who played important parts in that history. Czechs, for example, pointed proudly to their defense of the Union in the Civil War;  Greeks described American democracy as the apotheosis of Greek culture; and, on the 4th of July, Poles celebrated Washington and Jefferson but also two Polish heroes of the American Revolution, Thaddeus Kosciuszko and Casimir Polaski.</p>
<p>Lastly, these papers encouraged immigrants to become citizens, which would help them individually and increase the political power of their group.  These stances were not examples of cultural separatism but rather evidence of how these immigrants and their children became patriotic Americans <em>and</em> proud ethnics simultaneously. Several newspapers simply declared that the “old country” was their mother and America was their spouse.  They loved them both.  They were patriotic pluralists.</p>
<p>Can this kind of both/and consciousness develop amid the violent religious and ethnic struggles that we see in such places as Syria? I am less sanguine than Friedman about that possibility, but I agree with him that the only way something like patriotic pluralism can emerge in the Middle East is for us to support people there who, as he put it, “deeply long to be citizens.”</p>
<p>- Jeffrey Mirel</p>
]]></content:encoded>
			<wfw:commentRss>http://shankerblog.org/?feed=rss2&amp;p=5172</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Interpreting Achievement Gaps In New Jersey And Beyond</title>
		<link>http://shankerblog.org/?p=5102</link>
		<comments>http://shankerblog.org/?p=5102#comments</comments>
		<pubDate>Tue, 21 Feb 2012 14:34:03 +0000</pubDate>
		<dc:creator>Matthew Di Carlo</dc:creator>
				<category><![CDATA[Education]]></category>
		<category><![CDATA[achievement gap]]></category>
		<category><![CDATA[new jersey]]></category>
		<category><![CDATA[school effects]]></category>
		<category><![CDATA[standardized tests]]></category>
		<category><![CDATA[testing data]]></category>

		<guid isPermaLink="false">http://shankerblog.org/?p=5102</guid>
		<description><![CDATA[A recent statement by the New Jersey Department of Education is indicative of the caution required when interpreting the achievement gap, particularly over time, as a measure of student performance.]]></description>
			<content:encoded><![CDATA[<p>A recent <a href="http://www.nj.gov/education/news/2012/0209gap.htm">statement</a> by the New Jersey Department of Education (NJDOE) attempts to provide an empirical justification for that state’s focus on the achievement gap – the difference in testing performance between subgroups, usually defined in terms of race or income.</p>
<p>Achievement gaps, which receive a great deal of public <a href="http://www.nationalaffairs.com/publications/detail/our-achievement-gap-mania">attention</a>, are very useful in that they demonstrate the differences between student subgroups at any given point in time. This is significant, policy-relevant information, as it tells us something about the inequality of educational outcomes between the groups, which does not come through when looking at overall average scores.</p>
<p>Although paying attention to achievement gaps is an important priority, the NJDOE statement on the issue actually speaks directly to the fact, which is well-established and quite obvious, that one must exercise caution when interpreting these gaps, particularly over time, as measures of student performance.<span id="more-5102"></span></p>
<p>The central purpose of the statement is to argue that, while New Jersey&#8217;s overall test scores are among the best in the nation, achievement gaps – and their closure – should be the big <a href="http://www.nj.gov/education/news/2011/1208gap.htm">focus</a> of the state’s education policy, and the primary gauge of its success or failure. In an attempt to justify this view, and address directly the question of why the gap matters, the NJDOE statement asserts:</p>
<blockquote><p>[The achievement gap] matters to the nearly 40% of our students can&#8217;t read at grade level in 3rd grade &#8211; an indicator closely tied to future success in school. It matters to the thousands of students that drop out of high school or even before high school each year. And it matters to a high school dropout that faces a radically different future than a college graduate.</p></blockquote>
<p>This is more or less a non-sequitur. The cited statistics –40 percent of third graders aren’t reading at grade level and a high dropout rate – are troubling and important, but they’re not achievement gaps. In fact, the achievement gap as typically defined might actually conceal these outcomes.</p>
<p>The existence of a gap only tells you that there are differences in outcomes (e.g., scores) between two groups, which are almost always defined in terms of income or race. States and districts could have a large achievement gap, but still contain a relatively high proportion of students reading at grade level. For instance, a relatively affluent district could see virtually all third graders reading at grade level, but still have a significant achievement gap – with one group (e.g., high-income students) performing much higher than the other. Conversely, one could imagine a struggling district with a much smaller gap, but where virtually all students still scored below grade level.</p>
<p>If the NJDOE wants to make the (perfectly compelling) case that something should be done to help the 40 percent of third graders who it says cannot read at grade level, then those are the statistics they should cite. But it&#8217;s a little perplexing to use these figures, by themselves, as a justification for their focus on race- and income-based achievement gaps, which are a different type of measure. The better argument would have been to simply say that the state’s overall scores are quite high, but that this masks the fact that many students are still struggling. Differences in scores between subgroups are usually an underlying component of overall performance, but the latter might be obscured by the former, and vice-versa. It&#8217;s critical to examine both.*</p>
<p>But the big issues arise when achievement gaps are viewed over time. This is the primary focus of the NJDOE statement (as well as NJ&#8217;s overall 2011 <a href="http://www.nj.gov/governor/news/news/552011/pdf/20110407b_education.pdf">education reform agenda</a>). The state presents a bunch of graphs illustrating the trend in gaps between 2005 and 2011. The accompanying text asserts that race- and income-based gaps as measured by the state’s tests have been persistent since 2005 (<a href="http://nces.ed.gov/nationsreportcard/">NAEP</a> results support this characterization – for instance, the difference between students eligible and not eligible for free/reduced-price lunch in 2011 is not statistically different from that in 2005 in any of the four main assessments).**</p>
<p>Differences in performance between student subgroups are important, but the “narrowing of the achievement gap,” as a policy goal, also entails serious, well-known <em>measurement</em> problems, which can lead to misinterpretations. For example, as most people realize, the gap between two groups can narrow even if both <em>decline</em> in performance, so long as the higher-performing group decreases more rapidly. Similarly, an achievement gap can remain constant – and suggest policy failure – if the two groups attain strong, but similar, rates of improvement.</p>
<p>Put differently, trends in achievement gaps, by themselves, frequently hide as much as they reveal as far as the performance of the student subgroups they are comparing. These issues can only be addressed by , at the very least, decomposing the gaps and looking at each group separately<em>.</em> And that’s precisely the case in NJ.</p>
<p>The simple table below compares the change (between 2005 and 2011) in average NAEP scale scores for NJ students who are eligible for free/reduced-price lunch (lower-income) versus those who are not eligible (higher-income). I want to quickly note that these data are cross-sectional, and might therefore conceal differences in the cohorts of students taking the test, even when broken down by subgroups.***</p>
<p style="text-align: center;"><a href="http://shankerblog.org/wp-content/uploads/2012/02/njgapchange1.png"><img class="size-full wp-image-5114 aligncenter" title="njgapchange" src="http://shankerblog.org/wp-content/uploads/2012/02/njgapchange1.png" alt="" width="450" height="194" /></a></p>
<div>This table shows that, in three out of four NAEP tests, <em>both low- and higher-income cohorts’ scores have increased substantially,</em> at roughly similar rates. In fourth grade math, students eligible for free/reduced-price lunch scored six points higher in 2011 compared with 2005, the equivalent of roughly half a &#8220;year of learning,&#8221; compared with a similar, statistically discernible five point increase among non-eligible students. The results for eighth grade math and fourth grade reading are more noteworthy – on both tests, eligible students in NJ scored 12 points higher in 2011 than in 2005, while the 2011 cohorts of non-eligible students were higher by roughly similar margins.</div>
<p>In other words, achievement gaps in NJ didn’t narrow during these years because both the eligible and non-eligible cohorts scored higher in 2011 versus 2005. Viewed in isolation, the persistence of the resulting gaps might seem like a policy failure. But, while nobody can be satisfied with these differences and addressing them must be a focus going forward, the stability of the gaps actually masks notable success among both groups of students (at least to the degree that these changes reflect “real” progress rather than compositional changes).</p>
<p>Only in eighth grade reading was there a discrepancy between subgroups –the score for the 2011 cohort of FRLP-eligible students is statistically indistinguishable from that of the 2005 cohort. This means that, by social science conventions, we cannot dismiss the possibility that the former change was really just random noise. So, to the degree that there was a widening of the NJ achievement gap in eighth grade reading between 2005 and 2011 (and, as stated above, it also wasn’t large enough to be statistically significant), it’s because there was a discernible change for one subgroup but not the other. This exception is something worth looking into, and it is only revealed when both groups are viewed in terms of their absolute, not relative scores.</p>
<p>Similarly, if one looks exclusively at achievement gap trends in a simplistic manner, the substantial increases between cohorts in seven out of eight subgroup/exam combinations would be ignored, as would the fact (not shown in the table) that both eligible and non-eligible students score significantly higher than their counterparts nationally on all four assessments.</p>
<p>Roughly identical results are obtained for the subgroup changes if the achievement gap is defined in terms of race – there were equally large increases among both white and African-American cohorts between 2005 and 2011 in all four tests except eighth grade reading, where the change between African-American student cohorts was positive but not statistically significant.</p>
<p>(One important note about the interpretation of these data: The cohort changes among seven of the eight groups shown in the table [or the gaps in any given year], assuming that some of it is &#8220;real progress,&#8221; should not necessarily be entirely chalked up to the success of NJ schools <em>per se</em>. This represents the rather common error of <a href="http://shankerblog.org/?p=4980">conflating student and school performance</a>. That is, assuming that students’ testing results [and changes therein] are entirely due to schools’ performance – though it’s well-established <a href="http://www.mathematica-mpr.com/publications/pdfs/Education/False_Perf.pdf">empirically</a> that this is not the case. Some of the change is school-related, while some of it is a function of non-school factors [and/or sampling variation]. Without multivariate analysis using longitudinal data, it’s very difficult to tease out the proportion attributable to instructional quality.)</p>
<p>Nevertheless, the simple data above do suggest that an overintepretation of the achievement gap as an educational measure, without, at the very least, attention to the performance of constituent subgroups, can be problematic. Yes, in any given year, the differences between groups can serve as a useful gauge of inequality in outcomes, and, without question, we should endeavor to narrow these gaps going forward, while hopefully also boosting the achievement of all groups.</p>
<p>But it’s important to remember that the gaps by themselves, especially viewed over time, often mask as much important information as they reveal about the performance of each group, within and between states and districts, as well as the ways in which the actual quality of schools interact with them. Their significance can only be judged in context. States and districts must interpret gaps in a nuanced, multidimensional manner, lest they risk making policy decisions that could actually impede progress among the very students they most wish to support.</p>
<p>- Matt Di Carlo</p>
<p>*****</p>
<p>* It’s also worth noting that the achievement gap as defined above – the difference in scores between students eligible and not eligible for free/reduced-price lunch &#8211; is not statistically different from the U.S. public school student average in three out of four NAEP tests (with the exception being eighth grade reading, where the NJ gap is moderately larger).</p>
<p>** Most of the data presented in the NJDOE statement are achievement gaps on the state’s tests, as defined in terms of <em>proficiency rates</em> – that is, the difference in the overall proficiency rate between subgroups, such as students who are and are not eligible for free/reduced-price lunch. Using proficiency rates in serious policy analysis is almost always poor practice – they only tell you how many students are above or below a particular (and sometimes arbitrary) level of testing performance. However, measuring achievement gaps using these rates, especially over time, is almost certain to be misleading– an odd decision that one would not expect of a large state education agency. In this post, I use actual scale scores.</p>
<p>*** Since most achievement gaps compare two groups of students, they often mask huge underlying variation. For example, the comparison of students who are eligible versus those not eligible for free- and reduced-price lunches completely ignores the fact that this poverty measure only looks at students below a certain threshold, thus concealing the fact that <a href="http://schoolfinance101.wordpress.com/2011/05/05/why-comparing-naep-poverty-achievement-gaps-across-states-doesnt-work/">some students below that line are much more impoverished  than others</a>, making comparisons between states and districts (and over time) extremely difficult (making things worse, <a href="http://shankerblog.org/?p=4781">income is a very limited measure</a> of student background). On a related note, one infrequently-used but potentially informative conceptualization of the achievement gap is the difference between and high- and low-performing students (e.g., comparisons of scores between percentiles).</p>
]]></content:encoded>
			<wfw:commentRss>http://shankerblog.org/?feed=rss2&amp;p=5102</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>If Newspapers Are Going To Publish Teachers’ Value-Added Scores, They Need To Publish Error Margins Too</title>
		<link>http://shankerblog.org/?p=5087</link>
		<comments>http://shankerblog.org/?p=5087#comments</comments>
		<pubDate>Thu, 16 Feb 2012 16:19:17 +0000</pubDate>
		<dc:creator>Matthew Di Carlo</dc:creator>
				<category><![CDATA[Education]]></category>
		<category><![CDATA[new york city]]></category>
		<category><![CDATA[teacher quality]]></category>
		<category><![CDATA[value-added]]></category>

		<guid isPermaLink="false">http://shankerblog.org/?p=5087</guid>
		<description><![CDATA[If New York City newspapers are, thanks to a recent court ruling, going to publish the individual value-added scores of the city's teachers, they must, at the very least, also publish the large margins of error that accompany these estimates.]]></description>
			<content:encoded><![CDATA[<p>It seems as though New York City newspapers are going to <a href="http://blogs.wsj.com/metropolis/2012/02/14/teachers-ratings-to-go-public-after-union-loses-appeal/?mod=WSJBlog&amp;mod=WSJ_NY_NY_Blog">receive the value-added scores</a> of the city’s public school teachers, and publish them in an online database, as <a href="http://shankerblog.org/?p=1928">was the case</a> in Los Angeles.*</p>
<p>In my opinion, the publication will not only serve no useful purpose educationally, but it is also a grossly unfair infringement on the privacy of teachers. I have also <a href="http://shankerblog.org/?p=1093">argued previously</a> that putting the estimates online may serve to bias future results by exacerbating the non-random assignment of students to teachers (parents requesting [or not requesting] specific teachers based on published ratings), though it&#8217;s worth noting that the city is now using a different model.</p>
<p>That said, I don’t think there’s any way to avoid publication, given that about a dozen newspapers will receive the data, and it’s unlikely that every one of them will decline to do so. So, in addition to expressing my firm opposition, I would offer what I consider to be an absolutely necessary suggestion: If newspapers are going to publish the estimates, they need to publish the error margins too.<span id="more-5087"></span></p>
<p>Value-added and other growth model scores are statistical estimates, and must be interpreted as such. Imagine that a political poll found that a politician’s approval rate was 40 percent, but, due to an unusually small sample of respondents, the error margin on this estimate was plus or minus 20 percentage points. Based on these results, the approval rate might actually be abysmal (20 percent), or it might be pretty good (60 percent). Should a newspaper publish the 40 percent result without mentioning that level of imprecision? Of course not. In fact, they should refuse to publish the result at all.</p>
<p>Value-added estimates are <a href="http://shankerblog.org/?p=1383">no different</a>. Classes are small, and the estimates for some teachers are based on only one or two years worth of data. The performance of just a few outlier students can dramatically effect the estimate for a single teacher.  In other words, in many cases, samples are too small to produce estimates that are even remotely reliable (there is also measurement error in the tests themselves). Moreover, even for teachers who have more years of data, the imprecision in their estimates is <a href="http://ies.ed.gov/ncee/pubs/20104004/pdf/20104004.pdf">often large as well</a>.</p>
<p>We can actually illustrate this using <a href="http://www.scribd.com/doc/37648467/The-Use-of-Value-Added-Measures-of-Teacher-Effectiveness-in-Policy-and-Practice">real data</a> from New York City, where the average margin of error in value-added scores in 2007-08 (one of the years that will be released) was plus or minus roughly 30 percentile points. That means, for example, that a teacher scoring at the 40th percentile might actually be anywhere between the 10th and 70th percentile. Granted, this teacher is more likely to be 40th than 70th or 10th, but it&#8217;s all a matter of degree.</p>
<p>The estimate for this teacher does not even allow us to have any real confidence that he or she is above or below the median, and this will probably be the case for the majority of teachers in the city. Some will have smaller error margins than the average, some larger, but that’s precisely why they’re so important. Without this information, the estimates simply cannot be interpreted properly, and can be <em>extremely</em> misleading. Not only should the city’s newspapers report the error margins, they should be featured prominently.</p>
<p>If newspapers do otherwise, the estimates are certain to be misinterpreted. They would be violating not only (in my opinion) principles of fairness, but the most basic standards for accuracy as well.</p>
<p>- Matt Di Carlo</p>
<p>*****</p>
<p>* The actual information that newspapers will receive are the city’s “teacher data reports” (<a href="http://schools.nyc.gov/NR/rdonlyres/07FDCE5A-14D3-423D-93D0-399F91AFAB6F/103163/SAMPLETEACHERDATAREPORT.pdf">here’s a sample</a> of one), which do report error margins, meaning the papers will have the information. It’s possible that the papers will allow readers to download the reports themselves, but, given how many teachers there are in the city, it’s more likely that they will reproduce the data in a more concise, user-friendly format.</p>
]]></content:encoded>
			<wfw:commentRss>http://shankerblog.org/?feed=rss2&amp;p=5087</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Guessing About NAEP Results</title>
		<link>http://shankerblog.org/?p=5064</link>
		<comments>http://shankerblog.org/?p=5064#comments</comments>
		<pubDate>Wed, 15 Feb 2012 17:07:15 +0000</pubDate>
		<dc:creator>Matthew Di Carlo</dc:creator>
				<category><![CDATA[Education]]></category>
		<category><![CDATA[accountability]]></category>
		<category><![CDATA[education debate]]></category>
		<category><![CDATA[education research]]></category>
		<category><![CDATA[naep]]></category>
		<category><![CDATA[standardized tests]]></category>
		<category><![CDATA[testing data]]></category>

		<guid isPermaLink="false">http://shankerblog.org/?p=5064</guid>
		<description><![CDATA[There is an abundance of commentary trying to explain recent trends in national test scores. For the most part, these arguments are little more than speculation, but they also tend to presume incorrectly that schools and education policy are the only things that influence testing outcomes.]]></description>
			<content:encoded><![CDATA[<p>Every two years, the release of data from the <a href="http://nces.ed.gov/nationsreportcard/">National Assessment of Educational Progress</a> (NAEP) generates a wave of research and commentary trying to explain short- and long-term trends. For instance, there have been a bunch of recent attempts to “explain” an increase in aggregate NAEP scores during the late 1990s and 2000s. Some <a href="http://edexcellencemedia.net/publications/2011/20111215-The-Accountability-Plateau/20111215-The-Accountability-Plateau.pdf">analyses</a> postulate that the accountability provisions of NCLB were responsible, while more recent <a href="http://www.edexcellence.net/commentary/education-gadfly-daily/ohio-gadfly-daily/2011/lackluster-naep-results-from.html">arguments</a> have focused on the “effect” (or lack thereof) of newer market-based reforms – for example, looking to NAEP data to “prove” or “disprove” the idea that changes in teacher personnel and other policies have (or have not) generated “gains” in student test scores.</p>
<p>The basic idea here is that, for every increase or decrease in cross-sectional NAEP scores over a given period of time (both for all students and <a href="http://www.epi.org/publication/fact-challenged_policy/">especially for subgroups</a> such as minority and low-income students), there must be “something” in our education system that explains it. In many (but not all) cases, these discussions consist of little more than speculation. Discernible trends in NAEP test score data are almost certainly due to a combination of factors, and it’s unlikely that one policy or set of policies is dominant enough to be identified as “the one.” Now, there’s nothing necessarily wrong with speculation, so long as it is clearly identified as such, and conclusions presented accordingly. But I find it curious that some people involved with these speculative arguments seem a bit too willing to assume that schooling factors – rather than changes in cohorts’ circumstances outside of school – are the primary driver of NAEP trends.</p>
<p>So, let me try a little bit of <em>illustrative </em>speculation of my own: I might argue that changes in the economic conditions of American schoolchildren and their families are the most compelling explanation for changes in NAEP.<span id="more-5064"></span></p>
<p>Here’s my story: Early childhood (0-5) is the <a href="http://www.pnas.org/content/103/27/10155.full.pdf">most important time</a> for children as far as cognitive and non-cognitive development. When the economic circumstances &#8211; and all they entail &#8211; among families with young children increase relative to previous cohorts, the kids will perform better over the long term. The mid- to late-1990s was a time of remarkable economic growth, and the benefits were <a href="http://www.epi.org/publication/books_swa2000_swa2000intro/">shared by most workers</a> – higher- and lower-earners, whites and minorities. On average, children who were born or very young during this era experienced more economic stability and resources than their “predecessors.” We might therefore expect them to perform better on tests like NAEP.</p>
<p>To provide a ballpark identification of these children, let’s look at babies born around 1995, when U.S. economic growth started to pop. The two main NAEP assessments – reading and math – are administered to fourth and eighth graders every two years. Children born around 1995, who would have experienced improved economic conditions, on average, during early childhood, would have taken their first NAEP test at roughly age eight or nine (fourth grade), which means we should look at the NAEP results for fourth graders taking the test around 2003.</p>
<p>The trend in fourth grade math and reading is presented in the two simple graphs below, which are taken directly from the NAEP reports on <a href="http://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2012457">reading</a> and <a href="http://nces.ed.gov/nationsreportcard/pdf/main2011/2012458.pdf">math</a> (note that the two y-axes are a bit different).</p>
<p style="text-align: center;"><img class="size-full wp-image-5063 aligncenter" title="naeptrend" src="http://shankerblog.org/wp-content/uploads/2012/02/naeptrend.png" alt="" width="423" height="457" /></p>
<p>The &#8220;evidence&#8221; is fairly clear – the cohort of fourth graders tested in 2002 (reading) and 2003 (math) exhibited substantial increases relative to those tested in prior years. In reading, the increase was around six points between 2000 and 2002, which is equivalent to roughly half a “year of learning.” The increase in math was no less impressive – a jump of nine points between 2000 and 2003. In both subjects, the 2002-2003 gains were followed by significant, but much smaller, positive changes throughout the 2000’s, leveling out after 2007.</p>
<p>These raw data would seem to suggest that the NAEP increases since the start of this century – the subject of endless debate and advocacy, virtually all of which has been focused on NCLB and other education policies –  may, to a substantial degree, be due to the economic circumstances of families, rather than the quality of schooling. This supposition is supported by the <a href="http://shankerblog.org/?p=74">literature</a> demonstrating the influence of non-school factors (e.g., family background, mostly unobserved) on student performance <a href="http://shankerblog.org/?p=4980">as measured by tests</a>.</p>
<p>Now, let me reiterate my point: This “story,” and the descriptive evidence I present to support it, is no less a matter of speculation than anyone else’s. Education and learning are complex and subject to a interconnected web of causal influences, school- and non-school alike. Without question, good education policy can generate test score increases, which hopefully reflect “real” improvement, and it’s very plausible that federal, state, and local education policies are factors in the increases shown above.</p>
<p>That said, when you simply eyeball the data and then make causal arguments, you’re really just guessing (as I did above). That&#8217;s fine, but if you’re guessing without considering the fact that non-school factors are often the primary drivers of aggregate testing outcomes, it’s probably not a very good guess. This is especially true in the case of NAEP and other aggregate, cross-sectional data &#8211; <a href="http://www.mathematica-mpr.com/publications/pdfs/Education/False_Perf.pdf">non-school inputs can alter the characteristics of students taking the tests</a>, which can influence results, both nationally and for individual states/districts. The knee-jerk, “something has to explain it” tendency is to blame or replicate the policies that we are most focused on today, while ignoring the web of complex factors and practices, past and present, that may actually be much more salient to student success (see <a href="http://www.aera.net/uploadedFiles/Publications/Journals/Educational_Researcher/3605/07EDR07_268-278.pdf">here</a> and <a href="http://mirror.nber.org/cgi-bin/sendWP.cgi/1864852000/w10591.pdf">here</a> for more thorough, but still limited, analyses using NAEP, and <a href="http://epaa.asu.edu/ojs/article/view/102/228">this article</a> summarizing recent NAEP research and its issues).</p>
<p>Trying to make sense of an inherently messy situation, such as the causes and implications of testing trends, is a worthwhile endeavor – we have to try to understand these things. But ignoring the diversity of factors that influence learning makes it much, much messier. The best course is to rely on evidence that controls for time-varying school and non-school factors, preferably using longitudinal data, before drawing anything beyond tentative conclusions.</p>
<p>- Matt Di Carlo</p>
]]></content:encoded>
			<wfw:commentRss>http://shankerblog.org/?feed=rss2&amp;p=5064</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>A Case For Value-Added In Low-Stakes Contexts</title>
		<link>http://shankerblog.org/?p=5056</link>
		<comments>http://shankerblog.org/?p=5056#comments</comments>
		<pubDate>Mon, 13 Feb 2012 17:35:49 +0000</pubDate>
		<dc:creator>Matthew Di Carlo</dc:creator>
				<category><![CDATA[Education]]></category>
		<category><![CDATA[education reform]]></category>
		<category><![CDATA[standardized tests]]></category>
		<category><![CDATA[teacher effects]]></category>
		<category><![CDATA[value-added]]></category>

		<guid isPermaLink="false">http://shankerblog.org/?p=5056</guid>
		<description><![CDATA[All the controversy over using student test scores to evaluate and pay teachers may be drowning out the fact that these data, used correctly, have a great deal of potential to improve instruction.]]></description>
			<content:encoded><![CDATA[<p>Most of the controversy surrounding value-added and other test-based models of teacher productivity centers on the high-stakes use of these estimates. This is unfortunate – no matter what you think about these methods in the high-stakes context, they have a great deal of potential to improve instruction.</p>
<p>When supporters of value-added and other growth models talk about low-stakes applications, they tend to assert that the data will inspire and motivate teachers who are completely unaware that they’re not raising test scores. In other words, confronted with the value-added evidence that their performance is subpar (at least as far as tests are an indication), teachers will rethink their approach. I don’t find this very compelling. Value-added data will not help teachers – even those who believe in its utility – unless they know <em>why</em> their students’ performance appears to be comparatively low. It’s rather like telling a baseball player they’re not getting hits, or telling a chef that the food is bad – it’s not constructive.</p>
<p>Granted, a big problem is that value-added models are not actually designed to tell us why teachers get different results – i.e., whether certain instructional practices are associated with better student performance. But the data can be made useful in this context; the key is to present the information to teachers in the right way, and rely on their expertise to use it effectively.<span id="more-5056"></span></p>
<p>For example, one of the most promising approaches for translating less-than-informative teacher effect estimates into actionable information is <em>disaggregation</em> – i.e., presenting the estimates by student subgroup. For instance, if a teacher is told that her English language learners tend to make less rapid progress than her native speakers, this is potentially useful – she might rethink how she approaches those students and what additional supports they may need from the school system. Similarly, if there are strong gains among those students who started out at a lower level (i.e., their score the previous year) and stagnation for those starting out at a higher level, this suggests the need for more effective differentiation.</p>
<p>Needless to say, such information still requires strong professional judgment. In the end, teachers and administrators will have to figure out the specifics of any plan for improvement. In addition, since subgroups (e.g., non-native English speakers) are smaller samples, most teachers would need to have a few years of data in order to discern these patterns. Finally, and most obviously, teachers who do not understand or trust the estimates themselves are unlikely to respond to the data – explaining the methods and results, their strengths and weaknesses, is a necessary first step.</p>
<p>All of this suggests the critical importance of an issue that is not often discussed or researched – how to present value-added data to teachers in the most useful manner. Although a full discussion of this issue is beyond the scope of this post, a few quick suggestions, based on the discussion above, might include: a clear description of the methods and how to interpret the results, along with ongoing reach-out for training and a means (e.g., a “hotline”) for teachers to ask questions; presenting <a href="http://shankerblog.org/?p=1383">error margins</a> for each estimate (so teachers know when results are still too imprecise); and disaggregation of estimates by student subgroup.*</p>
<p>The majority of teachers, even those who are strongly skeptical about value-added, have long used testing data productively. Tests have always been used to diagnose student strengths and weaknesses, and skilled teachers have always used these data to help improve instruction. Value-added estimates could add even more useful data to that arsenal of information.</p>
<p>Hopefully, these productive low/no-stakes uses for value-added have not been drowned out by all the controversy over its high-stakes use. Research and policy should start focusing on the former as well.</p>
<p>- Matt Di Carlo</p>
<p>*****</p>
<p>* Many states and districts using growth model estimates produce some form of &#8220;teacher report.&#8221; My (admittedly limited) review of a few of these suggests that they vary quite widely in how they present the data, as well as in their descriptions and guidelines for interpretation. I might explore this more thoroughly in a future post.</p>
]]></content:encoded>
			<wfw:commentRss>http://shankerblog.org/?feed=rss2&amp;p=5056</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>A Big Open Question: Do Value-Added Estimates Match Up With Teachers’ Opinions Of Their Colleagues?</title>
		<link>http://shankerblog.org/?p=5037</link>
		<comments>http://shankerblog.org/?p=5037#comments</comments>
		<pubDate>Thu, 09 Feb 2012 16:10:59 +0000</pubDate>
		<dc:creator>Matthew Di Carlo</dc:creator>
				<category><![CDATA[Education]]></category>
		<category><![CDATA[education research]]></category>
		<category><![CDATA[student surveys]]></category>
		<category><![CDATA[survey data]]></category>
		<category><![CDATA[teacher evaluation]]></category>
		<category><![CDATA[value-added]]></category>

		<guid isPermaLink="false">http://shankerblog.org/?p=5037</guid>
		<description><![CDATA[The validity of test-based measures of teachers' effectiveness has been assessed by comparing these estimates with the results of principal observations and even student surveys, but not with one seemingly important outcome: Teachers' opinions of their colleagues.]]></description>
			<content:encoded><![CDATA[<p>A recent <a href="http://www.huffingtonpost.com/2012/02/07/tennessee-teacher-evaluat_n_1260790.html?page=2">article</a> about the implementation of new teacher evaluations in Tennessee details some of the complicated issues with which state officials, teachers and administrators are dealing in adapting to the new system. One of these issues is somewhat technical – whether the various components of evaluations, most notably principal observations and test-based productivity measures (e.g., value-added) – tend to “match up.” That is, whether teachers who score high on one measure tend to do similarly well on the other (see <a href="http://www.brookings.edu/~/media/Files/rc/reports/2011/0426_evaluating_teachers/0426_evaluating_teachers.pdf">here</a> for more on this issue).</p>
<p>In discussing this type of validation exercise, the article notes:</p>
<blockquote><p>If they don&#8217;t match up, the system&#8217;s usefulness and reliability could come into question, and it could lose credibility among educators.</p></blockquote>
<p>Value-added and other test-based measures of teacher productivity may have a credibility problem among many (but definitely not all) teachers, but I don’t think it’s due to – or can be helped much by – whether or not these estimates match up with observations or other measures being incorporated into states’ new systems. I’m all for this type of research (see <a href="http://economics.byu.edu/Documents/Lars%20Lefgren/papers/principals.pdf">here</a> and <a href="http://myweb.fsu.edu/tsass/Papers/IES%20Harris%20Sass%20Principal%20Eval%2034.pdf">here</a>), but I’ve never seen what I think would be an extremely useful study for addressing the credibility issue among teachers: One that looked at the relationship between value-added estimates and <em>teachers’ opinions of each other</em>.<span id="more-5037"></span></p>
<p>If a teacher has a high opinion of one of her colleagues’ effectiveness in the classroom, then it’s unlikely that a negative assessment from any external source – whether value-added models, the principal or even a fellow teacher who doesn’t work at that school – will override that judgment. This is perfectly normal – people tend to trust their own judgment above all else, especially when it’s professionals assessing on-the-job performance.*</p>
<p>If, on the other hand, a teacher found that all the colleagues he or she respected as educators received strong value-added scores, this is the kind of validation that <em>might</em> cause even the most ardent skeptic to rethink their position.</p>
<p>That’s 0ne reason why a systematic analysis of the relationship between teacher value-added estimates and teachers’ assessments of their colleagues could be so powerful. Obviously, such an examination would not allow individual teachers to view their own assessments by their colleagues, or those of other teachers. That would not only completely taint the results (many teachers would not be candid if they knew the individual-level results would shared), but it’s also unethical, and would likely cause serious problems within a school. The data would have to be completely private and the results reported overall, not school-by-school, and certainly not individually. For the same general reasons, I don&#8217;t think this type of measure could be incorporated into actual evaluations.</p>
<p>If, however, several of these analyses, or a big one that was conducted in a diverse set of schools and districts, showed that, on the whole, teachers who are highly regarded by their colleagues also tend to receive high value-added scores, this might not only boost the credibility of value-added estimates among some teachers, but, for the rest of us, it would represent a fairly powerful partial validation of these estimates’ ability to gauge teacher effectiveness (though, as always, the analysis would probably only include math and reading teachers).</p>
<p>And, of course, the converse is true: If a group of studies found only a weak relationship, this might erode some people’s credibility and compel less favorable policy conclusions (of course, the association would be a matter of degree, and different people could interpret it differently, especially if it turned out to be moderate).</p>
<p>It’s a little strange that such a study has not, to my knowledge, been conducted. After all, if the correlation between value-added and <em>students’</em> opinions has policy relevance (see <a href="http://www.metproject.org/downloads/Preliminary_Findings-Research_Paper.pdf">here</a>), then so does the estimates’ relationship with teachers’ opinions.</p>
<p>One reason why this type of analysis hasn&#8217;t been conducted might be the fact that there are significant hurdles. For example, among other problems, teachers vary in their familiarity with each others’ abilities, and they also maintain personal relationships that can color judgments. In addition, teachers may hesitate to bash their colleagues, even if they’re assured that their responses are completely confidential. But these issues are not uncommon in survey research, and I believe there are means of dealing with them.</p>
<p>Still, any rigorous project of this sort would require full cooperation from everyone involved. It would have to carefully designed, and would probably require some financial investment. But I think it would be well worth it.</p>
<p>- Matt Di Carlo</p>
<p>*****</p>
<p>* The relationship been value-added and <em>peer</em> observations is clearly a similar approach, one that could be done immediately, anywhere such a system exists. This is a good idea, but it&#8217;s not the same thing as what I&#8217;m proposing. Peer observation is a one-shot deal, and is typically (and correctly) carried out by a observer who does not have a day-to-day working relationship with the observed teacher.</p>
]]></content:encoded>
			<wfw:commentRss>http://shankerblog.org/?feed=rss2&amp;p=5037</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>A Look Inside Principals’ Decisions To Dismiss Teachers</title>
		<link>http://shankerblog.org/?p=5029</link>
		<comments>http://shankerblog.org/?p=5029#comments</comments>
		<pubDate>Wed, 08 Feb 2012 15:33:01 +0000</pubDate>
		<dc:creator>Matthew Di Carlo</dc:creator>
				<category><![CDATA[Education]]></category>
		<category><![CDATA[Paper Summary]]></category>
		<category><![CDATA[education reform]]></category>
		<category><![CDATA[education research]]></category>
		<category><![CDATA[teacher quality]]></category>
		<category><![CDATA[tenure]]></category>

		<guid isPermaLink="false">http://shankerblog.org/?p=5029</guid>
		<description><![CDATA[A recent peer-reviewed study finds that principals seem to consider several factors, including productivity, when making the decision to dismiss a teacher. But the results also suggest a more complicated situation than is portrayed by the common narrative about firing teachers.]]></description>
			<content:encoded><![CDATA[<p>Despite all the heated talk about how to identify and dismiss low-performing teachers, there’s relatively little research on how administrators choose whom to dismiss, whether various dismissal options might actually serve to improve performance, and other aspects in this area. A <a href="http://epa.sagepub.com/content/33/4/403.abstract">paper</a> by economist Brian Jacob, released as <a href="http://www.nber.org/papers/w15715">working paper</a> in 2010 and published late last year in the journal <em>Education Evaluation and Policy Analysis</em>, helps address at least one of these voids, by providing one of the few recent glimpses into administrators’ actual dismissal decisions.</p>
<p>Jacob exploits a change in Chicago Public Schools (CPS) personnel policy that took effect for the 2004-05 school year, one which strengthened principals’ ability to dismiss probationary teachers, allowing non-renewal for any reason, with minimal documentation. He was able to link these personnel records to student test scores, teacher and school characteristics and other variables, in order to examine the characteristics that principals might be considering, directly or indirectly, in deciding who would and would not be dismissed.</p>
<p>Jacob’s findings are intriguing, suggesting a more complicated situation than is sometimes acknowledged in the ongoing debate over teacher dismissal policy.<span id="more-5029"></span></p>
<p>Most importantly (but not surprisingly), he finds that there is an association between several proxies for teacher productivity and whether or not teachers were let go. For example, teachers who received lower evaluation ratings in the prior year were far more likely to be non-renewed, as were teachers with a lot more absences. Interestingly, teachers with master’s degrees and more experience were also significantly less likely to be dismissed, all else being equal.</p>
<p>In addition, while he was only able to calculate value-added scores for a subset of the full sample (i.e., teachers in tested grades and subjects), Jacob does find a relationship between these estimates and teacher dismissals. Specifically, a one standard deviation increase in teacher value-added (for example, comparing a teacher at the 50th percentile with one at the 84th) was associated with a decrease in the likelihood of dismissal of around 7 percentage points (a large increase relative to the probability for all these teachers). It bears noting, however, that Jacob finds no effect at all among high school teachers (ninth grade teachers only).</p>
<p>What these main findings suggest is less than shocking, but important nevertheless – principals appear to consider, directly or indirectly, teacher productivity (and other characteristics) when making decisions to let teachers go.* This also squares with <a href="http://economics.byu.edu/Documents/Lars%20Lefgren/papers/principals.pdf">previous research</a> showing that there is a discernible relationship between teachers’ value-added scores and the ratings they receive from principal observations.</p>
<p>There were a few surprises, however, among some of the purely descriptive results (i.e., raw tabulations). Two in particular stand out.</p>
<p>First, Jacob found that, despite the new policy allowing principals to dismiss probationary teachers at will, a rather high proportion of them didn’t do so. During each year between 2004-05 and 2006-07, principals in around 30-40 percent of Chicago schools chose not to dismiss a single probationary teacher. Further, this phenomenon was not at all limited to “high-performing” and/or low-poverty schools, where one might expect to find a stable, well-trained teaching force. For instance, in 2005, 35 percent of the “lowest performing” schools (the bottom 25 percent) chose not to dismiss any probationary teachers, as compared with 54 percent of the school with the highest absolute achievement levels (the proportions were similar when school performance was measured in terms of value-added).</p>
<p>In other words, when principals were given free rein to fire for any reason, with virtually no documentation or effort, a significant proportion chose not to use this power even once.</p>
<p>Obviously, we wouldn’t expect every single school to dismiss teachers, and there are several reasons why so many principals might have decided not to. For example, the new policy may have influenced some teachers to resign rather than be fired, and there was indeed an increase in voluntary separations after the new policy went into effect (which would mean the dismissal rates are understated). Principals may also have been less likely to dismiss probationary teachers that they themselves hired.</p>
<p>Nevertheless, on average, roughly one in three CPS teachers is on probation, and the fact that so many principals didn’t dismiss even one of them might suggest, as Jacob points out, that the rules governing the dismissal of probationary teachers were already more flexible than many seem to believe. In other words, principals don’t let go of a lot of teachers because they don’t want to, not because they can’t.</p>
<p>A second, perhaps more surprising, finding was that, among the teachers who were dismissed in any given year, around half of them were <em>rehired by a different school in the district</em>. Again, there are a bunch of possible explanations here.</p>
<p>For example, Jacob noted that some of the dismissals were due to position cuts, which might account for some of the rehiring. Also, there are certainly cases in which good teachers just don’t “fit in” at a school for whatever reason. In these cases, it’s quite possible that the shuffling of teachers <a href="http://www.ssc.wisc.edu/~scholz/Seminar/Jackson.pdf">could yield benefits for all</a> (though the rehired teachers were also substantially more likely than other first-year teachers to be let go again).</p>
<p>In considering these results, Jacob reported that there were, on average, ten applicants for every open position in CPS (and, almost certainly, the recession has caused this number to rise since then). Clearly, there were other people applying for the jobs, but principals hired previously-dismissed teachers at what seems like a strikingly high clip.</p>
<p>One other particularly plausible explanation is the labor supply – i.e., that there <a href="http://shankerblog.org/?p=1131">sometimes isn’t a pool of suitable replacements</a> waiting to fill vacancies. That is, principals rehired previously-dismissed teachers because they were the best candidates.</p>
<p>As always, we must be careful about drawing strong conclusions from one analysis. This paper is a rare look at teacher dismissals, but it is necessarily limited. Most notably, it only pertains to probationary teachers. In addition, the data are from only one large district during one three-year time period.</p>
<p>That said, Jacob’s results portray a complicated situation. On the one hand, principals do appear to make dismissal decisions that take into account teacher productivity. This is the paper’s main finding. While it’s important to bear in mind that this analysis is not designed to isolate the effect of the new policy on outcomes, such as student achievement, it does suggest that administrators’ role in dismissal decisions may serve to improve teacher quality over the long run.</p>
<p>On the other hand, there is little support for the idea that principals are just dying to fire at will – or that, once dismissed, teachers can easily be replaced by “better” alternatives – despite sometimes being taken for granted in our education debates. Although they are far from conclusive, and pertain only to probationary teachers, the descriptive results discussed above tentatively suggest that the supply of appropriate replacements may not always be quite as robust as is often assumed – and/or that there may be some other reasons for low dismissal rates that are not entirely a function of the difficulty of doing so.</p>
<p>In short, we should be careful not to reduce the complexity of employment policies and labor markets to a simple narrative in which personnel policies are the only impediment to improvements in teaching quality.</p>
<p>- Matt Di Carlo</p>
<p>*****</p>
<p>* The results also indicate that, even controlling for all the other variables, principals were significantly more likely to dismiss teachers with certain characteristics, such as men and older teachers. But it’s important to note that this does not necessarily represent evidence of discrimination. There are any number of unobserved characteristics, such as motivation or the ability to relate to children, that may be correlated with demographic characteristics such as gender and age. If that’s the case, then the models would “mistake” these factors for demographics. In other words, these methods are not designed to “detect” discrimination. Nevertheless, the differences by age and gender in the likelihood of dismissal are certainly cause for concern and further research, and Jacob dutifully acknowledges as much.</p>
]]></content:encoded>
			<wfw:commentRss>http://shankerblog.org/?feed=rss2&amp;p=5029</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Fundamental Flaws In The IFF Report On D.C. Schools</title>
		<link>http://shankerblog.org/?p=4994</link>
		<comments>http://shankerblog.org/?p=4994#comments</comments>
		<pubDate>Mon, 06 Feb 2012 14:09:46 +0000</pubDate>
		<dc:creator>Matthew Di Carlo</dc:creator>
				<category><![CDATA[Education]]></category>
		<category><![CDATA[Paper Summary]]></category>
		<category><![CDATA[accountability]]></category>
		<category><![CDATA[dcps]]></category>
		<category><![CDATA[district of columbia]]></category>
		<category><![CDATA[education reform]]></category>
		<category><![CDATA[education research]]></category>
		<category><![CDATA[school effects]]></category>
		<category><![CDATA[testing data]]></category>

		<guid isPermaLink="false">http://shankerblog.org/?p=4994</guid>
		<description><![CDATA[A new report, commissioned by the District of Columbia, makes school improvement recommendations, including the closure of a few dozen schools, based on a deeply flawed analysis of completely inappropriate data.]]></description>
			<content:encoded><![CDATA[<p>A new <a href="http://www.washingtonpost.com/r/2010-2019/WashingtonPost/2012/01/26/Education/Graphics/IFF_Final_Report.pdf">report</a>, commissioned by the District of Columbia Mayor Vincent Gray and conducted by the Chicago-based consulting organization <a href="http://iff.org">IFF</a>, was supposed to provide guidance on how the District might act and invest strategically in school improvement, including optimizing the distribution of students across schools, many of which are either over- or under-enrolled.</p>
<p>Needless to say, this is a monumental task. Not only does it entail the identification of high- and low-performing schools, but plans for improving them as well. Even the most rigorous efforts to achieve these goals, especially in a large city like D.C., would be to some degree speculative and error-prone.</p>
<p>This is not a rigorous effort. IFF’s final report is polished and attractive, with lovely maps and color-coded tables presenting a lot of summary statistics. But there’s no emperor underneath those clothes. The report&#8217;s data and analysis are so deeply flawed that its (rather non-specific) recommendations should not be taken seriously.<span id="more-4994"></span></p>
<p>The authors&#8217; general approach, slightly simplified, is as follows: All schools (DCPS and charter) are ranked, based on four measures: their school- and grade-level current (2011) proficiency rates in math and reading; and their “projected” 2016 proficiency rates in both subjects. The latter is derived from a simple OLS regression that uses each school’s prior rate increases (between 2007 and 2011) to predict where it will be five years down the road.</p>
<p>Based on these ranks, schools are sorted into quartiles, with the top-ranked (&#8220;high-performing&#8221;) schools classified as “Tier 1.” The core results compare the number of seats in the these “Tier 1” schools with the number of children within 39 neighborhood “clusters” (geographic subdivisions used by the District for other purposes). Reducing these discrepancies is the focus of the report&#8217;s policy conclusions.</p>
<p>IFF recommends that the District increase the number of students filling “high-performing seats” using a combination of (largely unspecified) investments, closures, moving students into schools under capacity and other strategies.</p>
<p>My first question in reading this report was: Why didn’t IFF use longitudinal student-level data and carry out a proper analysis? By relying exclusively on cross-sectional school- and grade-level proficiency rates, IFF essentially guarantees that their results will be little more than tentative, at best (also see this excellent <a href="http://greatergreaterwashington.org/post/13512/flawed-study-mis-rates-potential-dc-school-closings/">discussion</a> by economist Steve Glazerman, who makes some of these same points).</p>
<p>Proficiency rates can be useful, in that they may be easier to understand than average test scores. Nevertheless, they are not well-suited for serious analyses of school performance. They only tell you how many students scored above a certain cutoff point, and this hides a lot of the variation in actual performance. For example, two students – one scoring just above passing and the other scoring a year or two above grade level – would both be coded the same way, as “proficient (or above).” This is a large sacrifice of data.</p>
<p>In addition, while only a small part of IFF’s school rankings consist of “growth” measures (a portion of the variation in the “projected” 2016 rates, discussed below), the data they use are cross-sectional, and are therefore not appropriate. They don’t follow students over time, which means that the changes are not “growth” at all, and may to a large degree <a href="http://shankerblog.org/?p=592">reflect demographic and other shifts</a> in the student population of each school. This problem is especially severe at the grade-level (because samples are smaller), particularly in D.C., where <a href="http://shankerblog.org/?p=3637">student mobility is exceedingly high</a> (in no small part due to rapid charter proliferation). In other words, proficiency rates may change simply because somewhat different sets of students are being tested every year, not because there has been any real progress.</p>
<p>In short, IFF’s reliance on cross-sectional proficiency rates calls its entire analysis into question. These rates are terrible measures of performance, both in any given year and over time, and one can only wonder why a District consultant wouldn’t employ better data. Even a more rigorous analysis would have been suspect using these data.</p>
<p>But, in addition to the issue of school- and grade-level proficiency rates, there are serious problems with IFF’s primary analytical approach – i.e., the manner in which they sort schools into performance “tiers.” As stated above, each school’s performance is assessed by an average of four measures: 2011 proficiency rates in math and reading; and projected 2016 proficiency rates in both subjects, which are calculated based on each school’s (assumed to be linear) trajectory between 2007 and 2011 (both over the whole school and for grade clusters [K-5, 6-8, 9-12]).</p>
<p>There are many small problems with this scheme (e.g., the loss of data in rankings and sorting into “tiers”), but they’re all somewhat superfluous, given the fact that the method IFF uses does not measure the actual effectiveness of each school.*</p>
<p>Even using the best data (which IFF does not), testing results, whether proficiency rates or scores, are in large part <a href="http://shankerblog.org/?p=4980">due to either random variation or factors outside of schools’ control</a> (e.g., students’ backgrounds). Half of IFF’s school performance measure consists of 2011 proficiency rates in math and English, which are essentially just measures of student background. Unless you control for these factors, these rates can tell you almost nothing about the actual quality of instruction going on in the school. Schools with high rates <a href="http://shankerblog.org/?p=4903">are not necessarily high-performing schools</a>, and vice-versa.</p>
<p>The other half of IFF’s school rankings – the 2016 “projected” rates – aren’t much better, because they are simply added to the 2011 rate, and are therefore mostly just absolute performance measures (i.e., severely biased by student characteristics).</p>
<p>For example, let’s say we have two schools: One serves mostly higher-income students and has a 2011 proficiency rate of 80 percent; the other serves mostly lower-income students, and has a 2011 rate of 40 percent. Now, let’s say IFF projects that both will pick up 20 percentage points by 2016. Putting aside all the flaws in IFF’s methods (including the fact that their data are cross-sectional and therefore do not measure &#8220;growth&#8221; <em>per se</em>), this would imply that both schools are equally effective – students in both schools will make similar “progress” over the next five years.</p>
<p>But IFF’s methods don’t posit that 20 point increase as their measure. Instead, they add the projected “growth” (20 points) to the 2011 rates, which means that our higher-income school will have a projected 2016 rate of 100 percent, while our lower-income school will come in at 60 percent. The first school will appear to be far better, even though there was actually no difference in the schools’ effectiveness in boosting test scores (which is generally considered the appropriate gauge of school effectiveness).</p>
<p>As a result, even if their projection methods were appropriate (and they&#8217;re not), the manner in which IFF uses its “growth” measure – adding it to the 2011 rates – essentially ensures that the vast majority of the variation between schools in their final ratings will be due to factors outside of schools’ control.**</p>
<p>This analysis is therefore recommending the closure and expansion of schools, among other actions, based on a criterion that has little to do with how they actually perform, and is largely a function of the backgrounds of the students that attend them (e.g., income, whether or not they are native English speakers, etc.).</p>
<p>Overall, IFF had a very difficult task – identifying low- and high-performing schools – and, for whatever reason, they do not appear to have been up to the challenge. Their data are inappropriate and their methods too simplistic and flawed to accomplish the goals they set out to accomplish.</p>
<p>Even if IFF’s policy recommendations were sound – and they largely boil down to closing or improving low-performing schools, opening more high-performing schools, and monitoring schools in the middle – their methods for classifying these schools is not credible. As Bruce Baker <a href="http://schoolfinance101.wordpress.com/2012/02/03/newark-public-schools-lets-just-close-the-poor-schools-and-replace-them-with-less-poor-ones/">puts it</a>, IFF is essentially saying that we should “close poor schools and replace them with less poor ones” (also check out Glazerman’s above-mentioned article for more on the recommendations).</p>
<p>This report, though attractive and full of interesting summary data, provides little of value in terms of informing sound policy decisions.</p>
<p>- Matt Di Carlo</p>
<p>*****</p>
<p>* There really is a striking progression of data loss in this analysis. Most generally, as discussed above, IFF uses proficiency rates (how many students above or below the line), which ignores underlying variation in actual scores. In addition, they’re using cross-sectional grade- and school-level data, which masks differences between students in any given year, and over time. Then they use the rates (and projected rates) to calculate rankings, which ignore the extent of differences between schools. And, finally, the rankings are averaged and schools are sorted in quartiles (performance “tiers”), losing even more data – for example, schools at the “top” of “Tier 2” may have essentially the same scores as schools at the “bottom” of “Tier 1.” At each &#8220;step,&#8221; a significant chunk of the variation between schools in their students’ testing performance is forfeited.</p>
<p>** I cannot illustrate this bias directly – i.e., using real data from the report and the District’s schools – without a somewhat onerous manual data entry effort (neither D.C. nor IFF provide their data in a format that is convenient for analysts). But it’s really not necessary – it’s beyond dispute that absolute proficiency rates are largely a function of student characteristics, and IFF’s school rankings are based predominantly on those rates. See this <a href="http://shankerblog.org/?p=4941">discussion</a>, as well as <a href="http://shankerblog.org/?p=4903">this example</a> from Florida and <a href="http://shankerblog.org/?p=3652">this one</a> from Ohio.</p>
]]></content:encoded>
			<wfw:commentRss>http://shankerblog.org/?feed=rss2&amp;p=4994</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The Perilous Conflation Of Student And School Performance</title>
		<link>http://shankerblog.org/?p=4980</link>
		<comments>http://shankerblog.org/?p=4980#comments</comments>
		<pubDate>Thu, 02 Feb 2012 16:12:24 +0000</pubDate>
		<dc:creator>Matthew Di Carlo</dc:creator>
				<category><![CDATA[Education]]></category>
		<category><![CDATA[accountability]]></category>
		<category><![CDATA[education research]]></category>
		<category><![CDATA[school effects]]></category>
		<category><![CDATA[standardized tests]]></category>
		<category><![CDATA[testing data]]></category>

		<guid isPermaLink="false">http://shankerblog.org/?p=4980</guid>
		<description><![CDATA[Far too much of our public discourse and, disturbingly, policymaking seems to ignore the fact that raw, unadjusted standardized test results are primarily measures of student performance, not school performance.]]></description>
			<content:encoded><![CDATA[<p>Unlike many of my colleagues and friends, I personally support the use of standardized testing results in education policy, even, with caution and in a limited role, in high-stakes decisions. That said, I also think that the focus on test scores has <a href="http://shankerblog.org/?p=3165">gone way too far</a> and their use is being <a href="http://shankerblog.org/?p=1383">implemented unwisely</a>, in many cases to a degree at which I believe the policies will not only fail to generate improvement, but may even risk harm.</p>
<p>In addition, of course, tests have a very productive low-stakes role to play on the ground – for example, when teachers and administrators use the results for diagnosis and to inform instruction.</p>
<p>Frankly, I would be a lot more comfortable with the role of testing data – whether in policy, on the ground, or in our public discourse – but for the relentless flow of misinterpretation from both supporters and opponents. In my experience (which I acknowledge may not be representative of reality), by far the most common mistake is the conflation of student and school performance, as measured by testing results.</p>
<p>Consider the following three stylized arguments, which you can hear in some form almost every week:<span id="more-4980"></span></p>
<ol>
<li>Only one-third of our students are reading at grade level; our schools are failing;</li>
<li>95 percent of the teachers in this district receive satisfactory ratings, but that can’t be accurate, because only half the students are proficient in math and reading;</li>
<li>These reforms are working – state test scores have risen steadily.</li>
</ol>
<p>All three of these inferences are inappropriate for one primary reason: they fail to acknowledge that raw, unadjusted testing results – whether actual scores/proficiency rates or changes in those scores/rates – are not, by themselves, credible measures of school performance. They are largely (imperfect) measures of <em>student</em> performance. There is a difference.</p>
<p>Everyone involved in education knows that <a href="http://shankerblog.org/?p=74">most of the variation</a> in testing outcomes is “between students” – i.e., has to do with factors, most unmeasured/unobserved, that are attributes of the students themselves and their upbringing and environment (such as English proficiency, oral language development, background knowledge, family situation, etc.).</p>
<p>This well-established finding is sometimes interpreted to mean that schools (or teachers) can only exert minimal influence on student performance. That is false. Not only are schooling factors among the only targets within the purview of education policy, they can also be very influential. Improvements in the quality of schooling/instruction can have substantial effects on student outcomes (though I sometimes think we need to be <a href="http://shankerblog.org/?p=4496">more realistic</a> about the pace of change).</p>
<p>Nevertheless, learning is complex and much (if not most) of it occurs outside of schools and/or before children reach schooling age. Test scores – and changes in those scores – are subject to these influences. A school with low test scores is not necessarily a “failing school,” just as a school with very high scores is not necessarily successful.</p>
<p>Similarly, one should not assume that a school’s slow score growth is necessarily caused by a problem in that school. The reason why the research on school (and teacher) effects is so complex is that <a href="http://www.mathematica-mpr.com/publications/pdfs/Education/False_Perf.pdf">much of it is geared toward controlling for all of the external factors</a> that can be measured and are known to affect outcomes. In other words, the analysis is trying to isolate that portion of <em>student</em> performance that can reasonably be attributed to <em>school </em>performance. A great deal of the raw variation is also simple <a href="http://www.wcer.wisc.edu/news/events/VAM%20Conference%20Final%20Papers/MeasuringEffectSizes_BoydEtAl.pdf">random error</a>.</p>
<p>Yes, when a group of students&#8217; test scores rise over a few years, that’s a pretty good tentative indication that the school is doing something correctly. But it’s all a matter of degree. The gains (assuming they’re even measured with longitudinal data, which <a href="http://shankerblog.org/?p=592">they often are not</a>) will also reflect factors (e.g., prior achievement levels) that have nothing to do with the school, to an extent that can vary widely. If you rely solely on unadjusted testing results, you don’t know. And if you don’t know, you risk making decisions based on erroneous assumptions.</p>
<p>The worst part is that this distinction – between tests as measures of student performance versus school performance – is ignored by policymakers just as frequently as it is in our public discourse.</p>
<p>States are <a href="http://www.nytimes.com/2011/12/09/nyregion/12-new-york-schools-with-low-test-scores-are-put-on-closing-list.html">closing schools</a>, handing out <a href="http://shankerblog.org/?p=4903">ratings</a> and <a href="http://shankerblog.org/?p=4818">awarding grant money</a> based on horribly flawed misinterpretations of raw testing data. It’s one thing for journalists and the public to make this mistake; it’s something else entirely for the people we rely on to decide education policy to make it too.</p>
<p>In short, I would be a lot more optimistic about “data-driven decision making” if so many of the decision makers weren’t such erratic drivers.</p>
<p>- Matt Di Carlo</p>
]]></content:encoded>
			<wfw:commentRss>http://shankerblog.org/?feed=rss2&amp;p=4980</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>

