Judging health systems: reflections on WHO's methods
Original TextPhilip Musgrove PhD
The attainment values in WHO's World Health Report 20001 are spurious: only 39% are country-level observations. The responsiveness indicators are not comparable across countries; and three values obtained from expert informants were discarded in favour of imputed values. Indices of composite attainment and performance are based on imputations and thus are also meaningless. Member governments were not informed of the methods and sometimes suffered unjust criticism because of the rankings. Judgments about performance should be based on real data, represent methodological consensus, be built from less aggregated levels, and be useful for policy.
By way of explanation
Readers of the World Health Report 20001 will have noticed that although many numbers are in the Annex tables, there is scant reference in the text to the indicators: “Fortunately…the report appears to make very little connection between the results of the performance analysis and the implications for undertaking [the] functions [of health systems]”.2 Text authors were told essentially nothing about how some of the indicators were estimated until near the end of the report's production. Two exceptions should be noted. I participated in the “fair financing” indicator and coauthored two papers describing its construction.3, 4 Chapter 2 of the World Health Report 2000, which I wrote, was the last written, and reflects what I learned from the team working on the Annex in the last weeks. References to “WHO decision-makers” mean Christopher Murray and Julio Frenk—respectively, Director, Global Program on Evidence (GPE); and Executive Director, Evidence and Information for Policy (EIP). References to “WHO staff” include people who worked on the report but who did not make decisions.
Numbers in the Annex
The main feature of the attainment numbers is that most were not derived from any detailed national-level information. WHO decision-makers chose to run linear regressions on real data, and impute values for countries for which there were no data. The imputed values are indicated in the report by italicised numbers, but the footnote which says “Figures in italics are based on estimates” does not explain anything about the estimates. Even for five countries for which all indicators were measured directly, the performance index is the result of imputations. Because it was evident by the time Chapter 2 was written that a large share of the so-called data were going to be imputed, the only comparisons across countries that include the imputed values are figures 2.6 and 2.7 and the accompanying discussion. I now think it was a mistake to have put these figures in the text, which otherwise omitted such presentations.
Ministers of health of the world may have felt, the day the report was published, in the position of parents whose children had been given grades in courses in which they did not know they were enrolled. WHO representatives and liaison officers were also taken by surprise; the advance copies of the report and press materials they were given could not enable them to explain to outraged or baffled officials where the numbers came from. Only 39% of the indicator values represent real data, which falls to 18.5% if disabilityadjusted life expectancy is set aside. This indicator was the only one not imputed by regression for 118 of the 191 member states of WHO. Among these 118 states are 25 in the top 30 positions by attainment and 23 in the top 30 positions by performance. The panel and the table summarise the amount of imputation by country and indicator.
Shares of detailed national data in the overall attainment index
The number to the right of each country's name in the large table indicates whether detailed data were used for one, two, or three of the indicators. Somalia is given a zero, and Botswana a value of 1, because their estimates of child survival were not calculated from detailed national data. Health inequality (child survival), responsiveness (level and inequality together), and fair financing each account for 25% of the composite index of attainment. Therefore, the share of that index for a particular country that is based on detailed national-level information is given by 0.25(N+1), where N is the number above, and 1 is added to account for the share of attainment due to health level measured by disability-adjusted life expectancy (DALE).
The number of countries for each value of N, and the corresponding share of the information derived from detailed national data, are as follows:
Number of indicators for which detailed data were used (N)
For the 191 member states of WHO, the mean value of N is 0.555=[(118X0)+ (45X1)+(23X2)+(3X5)]/191. The corresponding mean share of the composite index derived from detailed national data is 0.389. That is, 61% of the numbers which go into the index are derived from imputations and only 39% are based on detailed analyses without any projection across countries. There are 118 countries for which N=0, because the only detailed national data refer to health level—that is 62% of all the countries, which is coincidentally almost equal to the 61% data share. The mean value of N for indicators other then health level is only 0.234, less than half the overall mean of N. This corresponds to a share of real numbers for those indicators (health inequality, responsiveness, and financial fairness) of only 18.5%, equivalent to having detailed and complete information for 35 countries.
Countries for which detailed national data were used to calculate components of the indexes of attainment and of performance
Groups of key informants were recruited in each of 35 countries and answered a questionnaire about their own country's health system. The heads of the groups agreed that the results could not be used to compare one country with another, because no informant actually looked at any country but his or her own. “There was a unanimous agreement that the instrument was unsuitable in capturing information universally on the domains that were decided by the WHO. We were made to understand that this was a pilot study and findings of this attempt would enable identified issues to be included in some final survey with a representative sample of adequate number. But ranking countries based on this pilot study has been inappropriate and embarrassing.”(From Appendix 2.2; Protest from India to use responsiveness data).5
Five imputed values were published for responsiveness although actual responses were obtained. In two cases the informants gave opinions on one province (Shandong, China) or state (Andra Pradesh, India) rather than the entire country. One part of such a vast country may not be representative of the whole; nonetheless, it surely represents the country better than an estimate based on 30 other countries. In three cases, the informants’opinions were disregarded, without the excuse of incomplete information: Chile (rating improved), Mexico (lowered), and Sri Lanka (improved). Discussion Papers 21 (pages 22, 23 and 25)6 and 22 (page 9)7 give supposed reasons for replacing the informants’evaluations. In two cases the justification is a health reform, and in the other a civil war. Neither situation was restricted to the countries named, nor was there any explanation why war or reform would make an imputed value more accurate than the opinion of well informed observers. Still less is it clear why health reform would make observers err in one direction in Chile but in the opposite direction in Mexico.
When I discovered this substitution, I wrote to Murray by e-mail on Aug 30, 2000, stating “if that doesn't qualify as manipulating the data, I don't know what does…At the very least, it gravely undermines the claim to be honest with the data and to report what we actually find.”8 Murray replied by e-mail the same day, saying that “if results from any survey lack face validity it would be rather counterproductive to simply go with them. It is the careful interplay of infomed assessment of the quality of the results and empirical findings that is the hallmark of the development of good data systems.” I leave the reader to judge the “face validity” of that justification.
I regard these issues as not merely statistical or even political, but ethical. WHO insists that member governments should not misrepresent the data they send to the organisation. It is important that WHO publications meet the same standards. My efforts to persuade Frenk and Murray that publication of these numbers was unethical were in vain.
The question of whether the numbers are honest has an occasional comic aspect. The Russian Minister of Health, without knowing how the numbers were arrived at to rank Russia 130th among health systems, declared himself unperturbed by WHO's judgment, believing it an honest reflection of the situation.8
Use and interpretation of imputation
Imputation can lead to misleading results: a clear example occurs with fair financing. Regression analysis of correlates of this indicator show, not surprisingly, that high income inequality makes it hard to achieve fairness in paying for health care.9 But what is important is how well the system offsets or compensates for that handicap. A country with a high Gini coefficient that achieves a good distribution of finance burden deserves praise for overcoming income inequality, rather than being penalised. Imputed values can of course err in the opposite direction.10 Finally, when imputations are combined into an overall index, they must be clearly interpretable. There are seven components for level of responsiveness, with an equation for each, and an eighth equation for the distribution. When the equations used to impute these values are added up, the same variable sometimes pulls in different directions and enters in different forms, and it is impossible to make sense of any overall effect.
When governments composing WHO's Executive Board asked for more complete technical explanations in time for their meeting in January, 2001, they were given a collection of existing discussion papers, plus some new material.11 No material that explained the imputations was included.
Usefulness for health policy
The attainment and performance estimates are of no use for judging how well a health system performs. They illustrate the mathematical truth that the difference between two complex numbers can be entirely imaginary. This fact has not stopped WHO staff from attempting to explain differences in countries’performance, as though the performance itself were real and accurately estimated.12 Such analysis amounts to guessing who built the canals on Mars. To discover that richer countries have more responsive systems is of little help to a poor country, nor is much gained from knowing inequality in child mortality is lower when total child deaths are fewer. What a concerned government wants to know is what it can do about systemic failings.
One reason the results are of no use for policy is that WHO decision-makers avoided any real participation by governments. There were consultative meetings in December, 1999, and January, 2000, attended by academic experts and staff from WHO headquarters and regional offices. Although the WHO framework was explained and the indicators described in general terms, participants were not told how any numbers were calculated, nor about the intention to publish imputed values. So far as I can establish, nothing said by participants at either meeting changed anything about the methods or numbers. Subsequent requests for information, even from ministers of health, were not answered (José Serra. Minister of Health for Brazil in letter to Gro Harlem Brundtland, Director-General of WHO, July 5, 2000). WHO staff insisted “We should not underestimate the intelligence of policy makers”,13 but in my view, the organisation did not respect the intelligence of those policy makers when defining and calculating the indicators and failing to explain them adequately. Murray also exaggerated the novelty of ideas and indicators in the report. He accused William Hsiao of Harvard University of plagiarising the WHO framework in a paper that discussed health-system goals,14 necessitating a refutation by Hsiao in an e-mail sent to WHO on Nov 27, 2000.
Why did WHO decision-makers proceed as they did?
Given the scientific and ethical objections, to say nothing of the political risks, the question arises as to why WHO took the course that it did. I was told in several conversations with Frenk, Murray, or both, why they thought WHO had to publish an index of performance, even with no consultation with the governments and with most of the numbers imputed. These reasons included:
The favourable opinion of Amartya Sen, who led the creation of the human development index (HDI). The HDI is not intended to establish a frontier of what countries ought to achieve, so although it has some of the same deficiencies as the WHO index, in that respect it is quite different.
The supposition that no-one was adequately concerned with health-system problems, thus a ranking was needed to call attention to them. This approach gives no credit to those who have been working on health-system reform in many countries for many years.
The assertion that nobody would have paid attention to an incomplete analysis restricted to real data. Possibly a more evidence-based analysis would have attracted less notice, because policy-makers pay too much attention to scorecards—partly because WHO and other organisations push them to do so.
The claim that the estimates, imputations, shortcuts, etc used to fill in the tables were better than any previous ones, such as those in the World Development Report 199315 or the Global Burden of Disease.16
Finally, WHO had to produce rankings in time for the 2000 report: any such urgency did not arise from the nature of the exercise or the needs of member states.
The frontier of the possible
The attempt to measure performance depends on a production function for attainments that is available to all health systems. The most striking result is that performance scores are correlated with health expenditure per person (figure 2.6 in the report).1 At expenditures below US$100, half the countries receive scores of 50% or less; poor performance is rare at higher spending levels. This result does not imply a minimum level of spending for a well functioning system, despite efforts by WHO staff to interpret it that way in Health System Performance: Concepts, Measurement and Determinants (pages 4—5).12 If a supposed frontier of what is attainable passes through a large area where no observations fall close to it and there is no assurance that countries actually could get closer to the frontier, the prima facie evidence is that the frontier has been drawn too far away.
Most countries that seem to perform poorly—all the lowest-ranked 18, and 33 of the lowest-ranking 36—are in sub-Saharan Africa. “A large part of the explanation is the HIV/AIDS epidemic”(page 43 of the report);1 there is no way to distinguish, however, between two interpretations: that AIDS is making it hard to reach the frontier, or that it has moved the frontier downward, reducing the best attainable health status. Have African countries fallen toward the floor, or has the roof collapsed on them?
The second interpretation means that deaths and disruption from AIDS have impaired African societies’capacities to deal with pre-existing diseases. Under the impact of the epidemic, money and education have become less effective in improving health. The first interpretation implies that African governments could and should control the health damage from AIDS with existing levels of expenditure and schooling. What is missing is any sense of the feasibility of controlling AIDS. Some preventive interventions have contributed to slowing the trend toward higher incidence;17 but there is no guarantee that control is feasible everywhere. Beyond the issue of interventions required for control is the question of how much they would cost. Estimates suggest that in sub-Saharan Africa, prevention, care, and antiretroviral therapy would require incremental spending, respectively, of at least $1.17 billion, $1.05 billion, and $0.72 billion yearly by 2007, and two, four, and eight times those amounts by 2015 (Improving Health Outcomes of the Poor, tables A.9—A.11),18 to keep prevalence from increasing. These estimates are equivalent to saying the frontier has moved down because of AIDS.
What money can accomplish depends on knowledge and on how money has been spent in the past. Two countries with the same educational level (measured as average number of years of education attained by the population over the age of 15 years), health expenditure, and other variables, but different recent histories, will not have identical possibilities for progress. Extra money or knowledge takes time to become effective. The time lag differs among diseases because scaling up requires more investment in infrastructure, training or other inputs for some than for others (Improving Health Outcomes of the Poor, page 86).18 In consequence, a frontier of the possible cannot be defined independently of the disease burden and the interval in which a country is supposed to improve.
Where to go from here?
Performance measurement relative to what health systems should be able to achieve is a chimera, at least in the highly aggregated, top-down fashion that WHO decision-makers have pursued. Measurement of attainments of various kinds is still valuable, as is looking for ways that health systems can achieve more and make better use of resources. Breaking down performance along one or more dimensions seems to be a good approach—for example, assessment of how the hospital subsystem functions and contributes to performance. Determining which outcomes to hold hospitals responsible for is complex; outcomes may look bad because primary care does not do its job properly. Getting those assessments right directs attention to specific areas, organisations, and policies. If that process cannot be done in reasonable time and at reasonable cost, it is better to abandon the performance measurement exercise and devote resources to areas of more immediate value to the people for whose benefit health systems exist.
Conflict of interest statement
During Sept 1, 1999, to Aug 22, 2001, I was seconded from the World Bank to WHO, where I served as editor-in-chief of The World Health Report 2000—Health Systems: Improving Performance. The interpretations and conclusions expressed in this paper are entirely my own. They do not necessarily represent the views of the World Bank, its executive directors, the countries they represent, WHO, its executive directors, or its member states. The paper was written after leaving WHO to return to the World Bank. There is no other possible conflict of interest. A full version of this article can be seen at http://image.thelancet.com/extras/02art2029webversion.pdf