S | k PR | STANFORD INSTITUTE FOR EGONOMIG POLICY RESEARCH Selection with Variation in Diagnostics Skall: Evidence from Radiologists David C. Chan Stanford University, Department of Veterans Affairs and NBER Matthew Gentzkow Stanford University and NBER Chuan Yu Stanford University September, 2021 Working Paper No. 21-057 SIEPR. | John A. and Cynthia Fry Gunn Building | sieprstanford.edu | estepr SELECTION WITH VARIATION IN DIAGNOSTIC SKILL: EVIDENCE FROM RADIOLOGISTS David C. Chan Matthew Gentzkow Chuan Yu* September 2021 Abstract Physicians, judges, teachers, and agents in many other settings differ systematically in the de- cisions they make when faced with similar cases. Standard approaches to interpreting and ex- ploiting such differences assume they arise solely from variation in preferences. We develop an alternative framework that allows variation in both preferences and diagnostic skill, and show that both dimensions may be partially identified in standard settings under quasi-random assign- ment. We apply this framework to study pneumonia diagnoses by radiologists. Diagnosis rates vary widely among radiologists, and descriptive evidence suggests that a large component of this variation is due to differences in diagnostic skill. Our estimated model suggests that radiologists view failing to diagnose a patient with pneumonia as more costly than incorrectly diagnosing one without, and that this leads less-skilled radiologists to optimally choose lower diagnostic thresh- olds. Variation in skill can explain 39 percent of the variation in diagnostic decisions, and policies that improve skill perform better than uniform decision guidelines. Failing to account for skill variation can lead to highly misleading results in research designs that use agent assignments as instruments. JEL Codes: Il, C26, J24, D81 Keywords: selection, skill, diagnosis, judges design, monotonicity *We thank Hanming Fang, Amy Finkelstein, Alex Frankel, Martin Hackmann, Nathan Hendren, Peter Hull, Karam Kang, Pat Kline, Jon Kolstad, Pierre-Thomas Leger, Jesse Shapiro, Gaurav Sood, Chris Walters, and numerous seminar and conference participants for helpful comments and suggestions. We also thank Zong Huang, Vidushi Jayathilak, Kevin Kloiber, Douglas Laporte, Uyseok Lee, Christopher Lim, Lisa Yi, and Saam Zahedian for excellent research assistance. The Stanford Institute for Economic Policy Research provided generous funding and support. Chan gratefully acknowledges support from NIH DP50D019903-01. 1 Introduction In a wide range of settings, agents facing similar problems make systematically different choices. Physicians differ in their propensity to choose aggressive treatments or order expensive tests, even when facing observably similar patients (Chandra et al. 2011; Van Parys and Skinner 2016; Molitor 2017). Judges differ in their propensity to hand down strict or lenient sentences, even when facing observably similar defendants (Kleinberg et al. 2018). Similar patterns hold for teachers, managers, and police officers (Bertrand and Schoar 2003; Figlio and Lucas 2004; Anwar and Fang 2006). Such variation is of interest both because it implies differences in resource allocation across similar cases and because it has increasingly been exploited in research designs using agent assignments as a source of quasi-random variation (e.g., Kling 2006). In all such settings, we can think of the decision process in two steps. First, there is an evaluation step in which decision-makers assess the likely effects of the possible decisions given the case before them. Physicians seek to diagnose a patient's underlying condition and assess the potential effects of treatment, judges seek to determine the facts of a crime and the likelihood of recidivism, and so on. We refer to the accuracy of these assessments as an agent's diagnostic skill. Second, there is a selection step in which the decision-maker decides what preference weights to apply to the various costs and benefits in determining the decision. We refer to these weights as an agent's preferences. In a stylized case of a binary decision d € {0,1}, we can think of the first step as ranking cases in terms of their appropriateness for d = 1 and the second step as choosing a cutoff in this ranking. While systematic variation in decisions could in principle come from either skill or preferences, a large part of the prior literature we discuss below assumes that agents differ only in the latter. This matters for the welfare evaluation of practice variation, as variation in preferences would suggest inefficiency relative to a social planner's preferred decision rule whereas variation in skill need not. It matters for the types of policies that are most likely to improve welfare, as uniform decision guidelines may be effective in the face of varying preferences but counterproductive in the face of varying skill. And, as we show below, it matters for research designs that use agents' decision rates as a source of identifying variation, as variation in skill will typically lead the key monotonicity assumption in such designs to be violated. In this paper, we introduce a framework to separate heterogeneity in skill and preferences when cases are quasi-randomly assigned, and apply it to study heterogeneity in pneumonia diagnoses made by radiologists. Pneumonia affects 450 million people and causes 4 million deaths every year world- wide (Ruuskanen et al. 2011). While it is more common and deadly in the developing world, it remains the eighth leading cause of death in the US, despite the availability of antibiotic treatment (Kung et al. 2008; File and Marrie 2010). Our framework starts with a classification problem in which both decisions and underlying states are binary. As in the standard one-sided selection model, the outcome only reveals the true state conditional on one of the two decisions. In our setting, the decision is whether to diagnose a patient and treat her with antibiotics, the state is whether the patient has pneumonia, and the state is only observed if the patient is not treated, since once a patient is given antibiotics it is often impossible to tell whether she actually had pneumonia or not. We refer to the share of a radiologist's patients diagnosed with pneumonia as her diagnosis rate. We refer to the share of patients who leave with undiagnosed pneumonia-i.e., the share of patients who are false negatives-as her miss rate. We draw close connections between two representations of agent decisions in this setting: (i) the reduced- form relationship between diagnosis and miss rates, which we observe directly in our data; and (ii) the relationship between true and false positive rates, commonly known as the receiver operating characteristic (ROC) curve. The ROC curve has a natural economic interpretation as a production possibilities frontier for "true positive" and "true negative" diagnoses. This framework thus maps skill and preferences to respective concepts of productive and allocative efficiency. Using Veterans Health Administration (VHA) data on 5.5 million chest X-rays in the emergency department (ED), we examine variation in diagnostic decisions and outcomes related to pneumonia across radiologists who are assigned imaging cases in a quasi-random fashion. We measure miss rates by the share of a radiologist's patients who are not diagnosed in the ED yet return with a pneumonia diagnosis in the next 10 days. We begin by demonstrating significant variation in both diagnosis and miss rates across radiologists. Reassigning patients from a radiologist in the 10th percentile of diagnosis rates to a radiologist in the 90th percentile would increase the probability of a diagnosis from 8.9 percent to 12.3 percent. Reassigning patients from a radiologist in the 10th percentile of miss rates to a radiologist in the 90th percentile would increase the probability of a false negative from 0.2 percent to 1.8 percent. These findings are consistent with prior evidence documenting variability in the diagnosis of pneumonia based on the same chest X-rays, both across and within radiologists (Abujudeh et al. 2010; Self et al. 2013). We then turn to the relationship between diagnosis and miss rates. At odds with the prediction of a standard model with no skill variation, we find that radiologists who diagnose at higher rates actually have higher rather than lower miss rates. A patient assigned to a radiologist with a higher diagnosis rate is more likely to go home with untreated pneumonia than one assigned to a radiologist with a lower diagnosis rate. This fact alone rejects the hypothesis that all radiologists operate on the same production possibilities frontier, and it suggests a large role for variation in skill. In addition, we find that there is substantial variation in the probability of false negatives conditional on diagnosis rate. For the same diagnosis rate, a radiologist in the 90th percentile of miss rates has a miss rate 0.7 percentage points higher than that of a radiologist in the 10th percentile. This evidence suggests that interpreting our data through a standard model that ignores skill could be highly misleading. At a minimum, it means that policies that focus on harmonizing diagnosis rates could miss important gains in improving skill. Moreover, such policies could be counter-productive if skill variation makes varying diagnosis rates optimal. If missing a diagnosis (a false negative) is more costly than falsely diagnosing a healthy patient (a false positive), a radiologist with noisier diagnostic information (less skill) may optimally diagnose more patients; requiring her to do otherwise could reduce efficiency. Finally, a standard research design that uses the assignment of radiologists as an instrument for pneumonia diagnosis would fail badly in this setting. We show that our reduced-form facts strongly reject the monotonicity conditions necessary for such a design. Applying the standard approach would yield the nonsensical conclusion that diagnosing a patient with pneumonia (and thus giving her antibiotics) makes her more likely to return to the emergency room with pneumonia in the near future. We show that, under quasi-random assignment of patients to radiologists, the joint distribution of diagnosis rates and miss rates can be used to identify partial orderings of skill among the radiologists. The intuition is simple: In any pair of radiologists, a radiologist that has both a higher diagnosis rate and a higher miss rate than the other radiologist must be lower-skilled. Similarly, a radiologist that has a lower or equal diagnosis rate but a higher miss rate, by a difference exceeding any difference in diagnosis rates, must also be lower-skilled. In the final part of the paper, we estimate a structural model of diagnostic decisions to permit a more precise characterization of these facts. Following our conceptual framework, radiologists first evaluate chest X-rays to form a signal of the underlying disease state and then select cases with signals above a certain threshold to diagnose with pneumonia. Undiagnosed patients who in fact have pneumonia will eventually develop clear symptoms, thus revealing false negative diagnoses. But among cases receiving a diagnosis, those who truly have pneumonia cannot be distinguished from those who do not. Radiologists may vary in their diagnostic accuracy, and each radiologist endogenously chooses a threshold selection rule in order to maximize utility. Radiologist utility depends on false negative and false positive diagnoses, and the relative utility weighting of these outcomes may vary across radiologists. We find that the average radiologist receives a signal that has a correlation of 0.85 with the pa- tient's underlying latent state, but that diagnostic accuracy varies widely, from a correlation with the latent state of 0.76 in the 10th percentile of radiologists to 0.93 in the 90th percentile. The disutility of missing diagnoses is on average 6.71 times as high as that of an unnecessary diagnosis; this ratio varies from 5.60 to 7.91 between the 10th and 90th radiologist percentiles. Overall, 39 percent of the variation in decisions and 78 percent of the variation in outcomes can be explained by variation in skill. We then consider the welfare implications of counterfactual policies. While eliminating variation in diagnosis rates always improves welfare under the (incorrect) assumption of uniform di- agnostic skill, we show that this policy may actually reduce welfare. In contrast, increasing diagnostic accuracy can yield much larger welfare gains. Finally, we document how diagnostic skill varies across groups of radiologists. Older radiologists or radiologists with higher chest X-ray volume have higher diagnostic skill. Higher-skilled radiol- ogists tend to issue shorter reports of their findings but spend more time generating those reports, suggesting that effort (rather than raw talent alone) may contribute to radiologist skill. Aversion to false negatives tends to be negatively related to radiologist skill. Our strategy for identifying causal effects relies on quasi-random assignment of cases to radiolo- gists. This assumption is particularly plausible in our ED setting because of idiosyncratic variation in the arrival of patients and the availability of radiologists conditional on time and location controls. To support this assumption, we show that a rich vector of patient characteristics that are strongly related to false negatives have limited predictive power for radiologist assignment. Comparing radiologists with high and low propensity to diagnose, we see statistically significant but economically small im- balance in patient characteristics in our full sample of stations, and negligible imbalance in a subset of stations selected for balanced assignment on a single characteristic, patient age. We show further that our main results are stable in this latter sample of stations, and robust to adding or removing controls for patient characteristics. Our findings relate most directly to a large and influential literature on practice variation in health care (Fisher et al. 2003a,b; Institute of Medicine 2013). This literature has robustly documented variation in spending and treatment decisions that has little correlation with patient outcomes. The seeming implication of this finding is that spending in health care provides little benefit to patients (Garber and Skinner 2008), a provocative hypothesis that has spurred an active body of research seeking to use natural experiments to identify the causal effect of spending (e.g., Doyle et al. 2015). In this paper, we build on Chandra and Staiger (2007) in investigating the possibility of heterogeneous productivity (e.g., physician skill) as an alternative explanation.' By exploiting the joint distribution of decisions and outcomes, we find significant variation in productivity, which rationalizes a large share of the variation in diagnostic decisions. The same mechanism may explain the weak relationship between decision rates and outcomes observed in other settings." Perhaps most closely related to our paper are evaluations by Abaluck et al. (2016) and Currie and MacLeod (2017), both of which examine diagnostic decision-making in health care. Abaluck et al. (2016) assume that physicians have the same diagnostic skill (i.e., the same ranking of cases) but may differ in where they set their thresholds for diagnosis. Currie and MacLeod (2017) assume that physicians have the same preferences but may differ in skill. Also related to our paper is a recent study of hospitals by Chandra and Staiger (2020), who allow for comparative advantage and differ- ent thresholds for treatment. In their model, the potential outcomes of treatment may differ across hospitals, but hospitals are equally skilled in ranking patients according to their potential outcomes.? Relative to these papers, a key difference of our study is that we use quasi-random assignment of cases to providers. More broadly, our work contributes to the health literature on diagnostic accuracy. While mostly descriptive, this literature suggests large welfare implications from diagnostic errors (Institute of Medicine 2015). Diagnostic errors account for 7 to 17 percent of adverse events in hospitals (Leape et al. 1991; Thomas et al. 2000). Postmortem examination research suggests that diagnostic errors contribute to 9 percent of patient deaths (Shojania et al. 2003). Finally, our paper contributes to the "judges-design" literature, which estimates treatment effects by exploiting quasi-random assignment to agents with different treatment propensities (e.g., Kling 'Doyle et al. (2010) show a potential relationship between physician human capital and resource utilization decisions. Gowrisankaran et al. (2017) and Ribers and Ullrich (2019) both provide evidence of variation in diagnostic and treatment skill, and Silver (2020) examines returns to time spent on patients by ED physicians and variation in the physicians' productivity. Mullainathan and Obermeyer (2019) show evidence of poor heart attack decisions (low skill) evaluated by a machine learning benchmark. Stern and Trajtenberg (1998) study variation in prescribing and suggest that some of it may relate to physicians' diagnostic skill. 2For example, Kleinberg et al. (2018) find that the increase in crime associated with judges that are more likely to release defendants on bail is about the same as if these more lenient judges randomly picked the extra defendants to release on bail. Arnold et al. (2018) find a similar relationship for black defendants being released on bail. Judges that are most likely to release defendants on bail in fact have slightly lower crime rates than judges that are less likely to grant bail. As in our setting, policy implications in these other settings will depend on the relationship between agent skill and preferences (see, e.g., Hoffman et al. 2018; Frankel 2021). 3Under this assumption, a sensible implication is that hospitals with comparative advantage for treatment should treat more patients. Interestingly, however, our work suggests that if comparative advantage (i.e., higher treatment effects on the treated) is microfounded on better diagnostic skill, then hospitals with such comparative advantage may instead optimally treat fewer patients. 2006). We show how variation in skill relates to the standard monotonicity assumption in the litera- ture, which requires that all agents order cases in the same way but may draw different thresholds for treatment (Imbens and Angrist 1994; Vytlacil 2002). Monotonicity can thus only hold if all agents have the same skill. Our empirical insight that we can test and quantify violations of monotonicity (or variation in skill) relates to conceptual work that exploits bounds on potential outcome distributions (Kitagawa 2015; Mourifie and Wan 2017) as well as more recent work to test instrument validity in the judges design (Frandsen et al. 2019) and to detect inconsistency in judicial decisions (Norris 2019).4 Our identification results and modeling framework are closely related to the contemporaneous work of Amold et al. (2020) who study racial bias in bail decisions. The remainder of this paper proceeds as follows. Sections 2 sets up a high-level empirical frame- work for our analysis. Section 3 describes the setting and data. Section 4 presents our reduced-form analysis, with the key finding that radiologists who diagnose more cases also miss more cases of pneumonia. Section 5 presents our structural analysis, separating radiologist diagnostic skill from preferences. Section 6 considers policy counterfactuals. Section 7 concludes. All appendix material is in the online appendix. 2 Empirical Framework 2.1 Setup We consider a population of agents j and cases i, with j (i) denoting the agent assigned case i. Agent j makes a binary decision dj; € {0, 1} for each assigned case (e.g., not treat or treat, acquit or convict). The goal is to align the decision with a binary state s; € {0, 1} (e.g., healthy or sick, innocent or guilty). The agent does not observe s; directly but observes a realization w;; € R of a signal with distribution F; (-|s;) € A(R) that may be informative about s;, and she chooses dj; based only on this signal. This setup is the well-known problem of statistical classification. For agent j, we can define the probabilities of four outcomes (Panel A of Figure I): true positives, or TP; = Pr (dij =1,s;= 1); false positives (type I errors), or FP; = Pr (dj; = 1, 5; = 0); true negatives, or TN; = Pr (d;; = 0,5; =0); and false negatives (type II errors), or FN; = Pr(dj; =0,s; =1). P; =TP;+FP; denotes the expected 4Kitagawa (2015) and Mourifie and Wan (2017) develop tests of instrument validity based on an older insight in the literature noting that instrument validity implies non-negative densities of compliers for any potential outcome (Imbens and Rubin 1997; Balke and Pearl 1997; Heckman and Vytlacil 2005). Recent work by Machado et al. (2019) also exploits bounds in a binary outcome to test instrument validity and to sign average treatment effects. Similar to Frandsen et al. (2019), we define a monotonicity condition in the judges design that is weaker than the standard one considered in these papers. However, we demonstrate a test that is stronger than the standard in the judges-design literature. proportion of cases j classifies as positive, and S; = TP; + FN; denotes the prevalence of s; = 1 in j's population of cases. We refer to P; as j's diagnosis rate, and we refer to FN; as her miss rate. Each agent maximizes a utility function u; (d,s) with u; (1,1) > u; (0,1) and u; (0,0) > u; (1,0). We assume without loss of generality that the posterior probability of s; = 1 is increasing in w;;, so that any optimal decision rule can be represented by a threshold 7; with d;; = 1 if and only if w;; > 7;. We define agents' skill based on the Blackwell (1953) informativeness of their signals. Agent j is (weakly) more skilled than j' if and only if F; is (weakly) more Blackwell-informative than Fj. By the definition of Blackwell informativeness, this will be true if either of two equivalent conditions hold: (i) for any arbitrary utility function u(d,s), ex ante expected utility from an optimal decision based on observing a draw from F; is greater than from an optimal decision based on observing a draw from Fr; (ii) Fj can be produced by combining a draw from F; with random noise uncorrelated with s;. We say that two agents have the same skill if their signals are equal in the Blackwell ordering, and we say that skill is uniform if all agents have equal skill. The Blackwell ordering is incomplete in general, and it is possible that agent j is neither more nor less skilled than j'. This could happen, for example, if F; is relatively more accurate in state s = 0 while F; is relatively more accurate in state s = 1. In the case in which all agents can be ranked by skill, we can associate each agent with an index of skill a € IR, where j is more skilled than j' if and only if a; 2 aj. 2.2 ROC Curves A standard way to summarize the accuracy of classification is in terms of the receiver operating TP; characteristic (ROC) curve. This plots the true positive rate, or TPR; = Pr (d; 7 =1|s;=1 ) = TP FN? against the false positive rate, or F PR; =Pr (dj; = 1|s; = 0) = FPaER with the curve for a particular signal F; indicating the set of all (FPR;,T PR;) that can be produced by a decision rule of the form d;j = 1(wi; > 7;) for some 7;. Panel B in Figure I shows several possible ROC curves. In the context of our model, the ROC curve of agent j represents the frontier of potential classi- fication outcomes she can achieve as she varies the proportion of cases P; she classifies as positive. If the agent diagnoses no cases (1; = 00), she will have TPR; = 0 and FPR; = 0. If she diagnoses all cases (1; = -0o), she will have TPR; = 1 and FPR; = 1. As she increases P; (decreases 7;), both TPR; and FPR; must weakly increase. The ROC curve thus reveals a technological tradeoff between the "sensitivity" (or TPR;) and "specificity" (or 1 -- F PR;) of classification. It is straightforward to show that in our model, where the likelihood of s; = 1 is monotonic in w;;, the ROC curves give the maximum TPR; achievable for each FPR;, and they not only must be increasing but also must be concave and lie above the 45-degree line.* If agent j is more skilled than agent j', any (PF PR,TPR) pair achievable by j' is also achievable by j. This follows immediately from the definition of Blackwell informativeness, as j can always reproduce the signal of j' by adding random noise. Remark 1. Agent j has higher skill than j' if and only if the ROC curve of agent j lies everywhere weakly above the ROC curve of agent j'. Agents j and j' have equal skill if and only if their ROC curves are identical. The classification framework is closely linked with the standard economic framework of produc- tion. An ROC curve can be viewed as a production possibilities frontier of TPR; and 1 - FPR;. Agents on higher ROC curves are more productive (i.e., more skilled) in the evaluation stage. Where an agent chooses to locate on an ROC curve depends on her preferences, or the tangency between the ROC curve and an indifference curve. It is possible that agents differ in preferences but not skill, so that they lie along identical ROC curves, and we would observe a positive correlation between TPR; and FPR; across j. It is also possible that they differ in skill but not preferences, so that they lie at the tangency point on different ROC curves, and we could observe a negative correlation between TPR; and FPR; across j. Figure II illustrates these two cases with hypothetical data on the joint distribution of decisions and outcomes. This figure suggests some intuition, which we will formalize later, for how skill and preferences may be separately identified. In the empirical analysis below, we will visualize the data in two spaces. The first is the ROC space of Figure II. The second is a plot of miss rates FN; against diagnosis rates P;, which we refer to as "reduced-form space." When cases are randomly assigned, so that S; is the same for all j, there exists a one-to-one correspondence between these two ways of looking at the data, and the slope relating FN; to P; in reduced-form space provides a direct test of uniform skill.® Remark 2. Suppose S; = Pr(s; = 1|j @ = j) is equal to a constant S for all j. Then for any two agents j and j', 1. (TPR;,FPR;) = (TPR;, FPR;,) if and only if (F.N;, Pj) = (FNy, Pj). 5Concavity follows from observing that if (FPR,TPR) and (FPR',TPR') are two points on an agent's ROC curve generated by using thresholds 7 and 7', the agent can also achieve any convex combination of these points by randomizing between 7 and tT'. That the ROC curve must lie weakly above the 45-degree line follows from noting that for any FPR an agent can achieve TPR = FPR by ignoring her signal and choosing d = 1 with probability equal to FPR. The maximum achievable TPR associated with this PR must therefore be weakly larger. ©The two facts in Remark 2 are immediate from the observation that FN; = S; (1-TPR,) and P; = S;-TPR; + (1-Sj)- FPR; combined with the fact that ROC curves are increasing. 2. If the agents have equal skill and P; # P;, ae € [-1,0]. 2.3 Potential Outcomes and the Judges Design When there is an outcome of interest y;; = y; (dij) that only depends on the agent's decision d;,, we can map our classification framework to the potential outcomes framework with heterogeneous treatment effects (Rubin 1974; Imbens and Angrist 1994). The object of interest is some average of the treatment effects y; (1) - y; (0) across individuals. We observe case i assigned to only one agent /, which we denote as j(i), so the identification challenge is that we only observe d; = )1;1(G = 7 @) dij and yi = 1; 1G = 7) yiz = yi (G) corresponding to j = j (i). A growing literature starting with Kling (2006) has proposed using heterogeneous decision propen- sities of agents to identify these average treatment effects in settings where cases i are randomly assigned to agents j with different propensities of treatment. This empirical structure is popularly known as the "judges design," referring to early applications in settings with judges as agents. The literature typically assumes conditions of instrumental variable (IV) validity from Imbens and Angrist (1994).' This guarantees that an IV regression of y; on d; instrumenting for the latter with indicators for the assigned agent recovers a consistent estimate of the local average treatment effect (LATE). Condition 1 (IV Validity). Consider the potential outcome y;; and the treatment response indicator d;; € {0,1} for case i and agent j. For a set of two or more agents j, and a random sample of cases i, the following conditions hold: (i) Exclusion: y;; = y;(dj;) with probability 1. (ii) Independence: (y;(0), y;(1), d;;) is independent of the assigned agent j(i). (iii) Strict Monotonicity: For any j and j', djj > dj; Vi, or dj; < dj; Vi, with probability 1. Vytlacil (2002) shows that Condition 1(iii) is equivalent to all agents ordering cases by the same latent index w; and then choosing d;; = 1(w; > 71;), where 7; is an agent-specific cutoff. Note that this implies that the data must be consistent with all agents having the same signals and thus the same skill. An agent with a lower cutoff must have a weakly higher rate of both true and false positives. Condition 1 thus greatly restricts the pattern of outcomes in the classification framework. Remark 3. Suppose Condition 1 holds. Then the observed data must be consistent with all agents FN-FNy © 11 9), having uniform skill. By Remark 2, for any two agents j and j', we must have PP J 7In addition to the assumption below, we also require instrument relevance, such that Pr (a j= 1) # Pr CA y= 1) for some j and j'. This requirement can be assessed by a first stage regression of d; on judge indicators. This implication is consistent with prior work on IV validity (Balke and Pearl 1997; Heck- man and Vytlacil 2005; Kitagawa 2015). If we define y,; to be an indicator for a false negative and consider a binary instrument defined by assignment to either j or j', Equation (1.1) of Kita- gawa (2015) directly implies Remark 3. An additional intuition is that under Condition 1, for any outcome y,;, the Wald estimand comparing a population of cases assigned to agents j and j' is Za af = E[yi(1)- y:(0)| dj > diy], where Y; is the average of y,; among cases treated by j (Im- J Jj bens and Angrist 1994). If we define y; to be an indicator for a false negative, the Wald estimand lies in [-1,0], since y; (1) - y; (0) € {-1, 0}. By Remark 3, strict monotonicity in Condition 1(iii) of the judges design implies uniform skill. The converse is not true, however. Agents with uniform skill may yet violate strict monotonicity. For example, if their signals are drawn independently from the same distribution, they might order different cases differently by random chance. One might ask whether a condition weaker than strict monotonicity might be both consistent with our data and sufficient for the judges design to recover a well-defined LATE. Frandsen et al. (2019) introduce one such condition, which they call "average monotonicity." This requires that the covariance between agents' average treatment propensities and their potential treatment decisions for each case i be positive. To define the condition formally, let p; be the share of cases assigned to agent j, let P = > ; ;7P; be the p-weighted average treatment propensity, and let d;=>. ; ;4i; be the p-weighted average potential treatment of case i. Condition 2 (Average Monotonicity). For ail i, Lei ("s-?) (ai - 4) > 0. Frandsen et al. (2019) show that Condition 2, in place of Condition 1(iii), is sufficient for the judges design to recover a well-defined LATE. We note two more-primitive conditions that are each sufficient for average monotonicity. One is that the probability that j diagnoses patient i is either higher or lower than the probability 7' diagnoses patient i for all i. The other is that variation in skill is orthogonal to the diagnosis rate in a large population of agents. Condition 3 (Probabilistic Monotonicity). For any j and j', Pr (dj; = 1) = Pr (dj; = 1) or Pr (dij = 1) < Pr (dj; = 1), for alli. 10 Condition 4 (Skill-Propensity Independence). (i) All agents can be ranked by skill and we associate each agent with an index a; such that j is more skilled than j' if and only if a; = aj; (ii) probabilistic monotonicity (Condition 3) holds for any pair of agents j and j' with equal skill; (iii) the diagnosis rate P; is independent of a; in the population of agents. In Appendix A, we show that Condition 3 implies Condition 2. We also show that, in the limit as the number of agents grows large, Condition 4 implies Condition 2. Under any assumption that implies the judges design recovers a well-defined LATE, the coeffi- cient estimand A from a regression of FN; on P; must lie in the interval [-1, 0].8 The implication that A € [-1,0]-or, equivalently, Pr(s; = 1) € [0,1] among compliers weighted by their contribution to the LATE-is our proposed test of monotonicity. While this test may fail to detect monotonic- ity violations, we show in Appendix D that it nevertheless may be stronger than the standard tests of monotonicity in the judges-design literature because it relies on the key (unobserved) state for selection instead of observable characteristics. The results we show below imply A ¢ [-1,0]. They thus imply violation not only of the strict monotonicity of Condition 1(iii) but also of any of the weaker monotonicity Conditions 2, 3, and 4. They not only reject uniform skill but also imply that skill must be systematically correlated with diagnostic propensities. In Section 5, we show why violations of even these weaker monotonicity conditions are natural: When radiologists differ in skill and are aware of these differences, the optimal diagnostic threshold will typically depend on radiologist skill, particularly when the costs of false negatives and false positives are asymmetric. We also show that this relationship between skill and radiologist-chosen diagnostic propensities raises the possibility that common diagnostic thresholds may reduce welfare. 3 Setting and Data We apply our framework to study pneumonia diagnoses in the emergency department (ED). Pneumo- nia is acommon and potentially deadly disease that is primarily diagnosed by chest X-rays. Reading chest X-rays requires skill, as illustrated in Figure III, which shows example chest X-ray images from the medical literature. We focus on outcomes related to chest X-rays performed in EDs in the Veterans Health Administration (VHA), the largest health care delivery system in the US. 8 As noted above, any LATE for the effect of d; on y; = mj = 1(d; = 0,8; = 1) must lie in the interval [-1,0]. This implies that the judges-design IV coefficient estimand from a regression of m; on d; instrumenting with radiologist indicators must lie in this interval. This corresponds to an OLS coefficient estimand from a regression of FN; on Pj. 11 In this setting, the diagnostic pathway for pneumonia is as follows: 1. A physician orders a radiology exam for a patient suspected to have the disease. 2. Once the radiology exam is performed, the image is assigned to a radiologist. Exams are typi- cally assigned to radiologists based on whoever is on call at the time the exam needs to be read. We argue below that this assignment is quasi-random conditional on appropriate covariates. 3. The radiologist issues a report on her findings. 4. The patient may be diagnosed and treated by the ordering physician in consultation with the radiologist. Pneumonia diagnosis is a joint decision by radiologists and physicians. Physician assignment to patients may be non-random, and physicians can affect diagnosis both via their selection of patients to order X-rays for in step 1 and their diagnostic propensities in step 4. However, so long as assignment of radiologists in step 2 is as good as random, we can infer the causal effect of radiologists on the probability that the joint decision-making process leads to a diagnosis. While interactions between radiologists and ordering physicians are interesting, we abstract from them in this paper and focus on a radiologist's average effect, taking as given the set of physicians with whom she works. VHA facilities are divided into local units called "stations." A station typically has a single major tertiary care hospital and a single ED location, together with some medical centers and outpatient clinics. These locations share the same electronic health record and order entry system. We study the 104 VHA stations that have at least one ED. Our primary sample consists of the roughly 5.5 million completed chest X-rays in these stations that were ordered in the ED and performed between October 1999 and September 2015. We refer to these observations as "cases." Each case is associated with a patient and with a radiologist assigned to read it. In the rare cases where a patient received more than one X-ray on a single day, we assign the case to the radiologist associated with the first X-ray observed in the day. To define our main analysis sample, we first omit the roughly 600,000 cases for which the patient had at least one chest X-ray ordered in the ED in the previous 30 days. We then omit cases with miss- ing radiologist identity, patient age, or patient gender, or with patient age greater than 100 or less than 20. Finally, we omit cases associated with a radiologist-month pair with fewer than 5 observations and cases associated with a radiologist with fewer than 100 observations in total. Appendix Table *We define chest X-rays by the Current Procedural Terminology (CPT) codes 71010 and 71020. 12 A.1 reports the number of observations dropped at each of these steps. The final sample contains 4,663, 840 cases and 3, 199 radiologists.!° We define the diagnosis indicator d; for case i equal to one if the patient has a pneumonia diagnosis recorded in an outpatient or inpatient visit whose start time falls within a 24-hour window centered at the time stamp of the chest X-ray order.!! We confirm that 92.6 percent of patients who are recorded to have a diagnosis of pneumonia are also prescribed an antibiotic consistent with pneumonia treatment within five days after the chest X-ray. We define a false negative indicator m; = 1(d; = 0,5; = 1) for case i equal to one if d; = 0 and the patient has a subsequent pneumonia diagnosis recorded between 12 hours and 10 days after the initial chest X-ray. We include diagnoses in both ED and non-ED facilities, including outpatient, inpatient, and surgical encounters. In practice m; is measured with error because it requires the patient to return to a VHA facility and for the second visit to correctly identify pneumonia. We show robustness of our results to endogenous second diagnoses by restricting analyses to veterans who solely use the VHA and who are sick enough to be admitted on the second visit in Section 5.4. We define the following patient characteristics for each case i: demographics (age, gender, marital status, religion, race, veteran status, and distance from home to the VA facility where the X-ray is ordered), prior health care utilization (counts of outpatient visits, inpatient admissions, and ED visits in any VHA facility in the previous 365 days), prior medical comorbidities (indicators for prior diagnosis of pneumonia and 31 Elixhauser comorbidity indicators in the previous 365 days), vital signs (e.g., blood pressure, pulse, pain score, and temperature), and white blood cell (WBC) count as of ED encounter. For each case, we also measure characteristics associated with the chest X-ray request. This contains an indicator for whether the request was marked as urgent, an indicator for whether the X-ray involved one or two views, and requesting physician characteristics that we define below. For each variable that contains missing values, we replace missing values with zero and add an indicator for whether the variable is missing. Altogether, this yields 77 variables of patient and order characteristics (hereafter, "patient characteristics" for brevity) in five categories, 11 of which are indicators for missing values. We detail all these variables in Appendix Table A.2. For each radiologist in the sample, we record gender, date of birth, VHA employment start date, 10 Appendix Figure A.1 presents distributions of cases across radiologists and radiologist-months and of radiologists across stations and station-months. '1Diagnoses do not have time stamps per se but are instead linked to visits, with time stamps for when the visits begin. Therefore, the time associated with diagnoses is usually before the chest X-ray order; in a minority of cases, a secondary visit (e.g., an inpatient visit) occurs shortly after the initial ED visit, and we will observe a diagnosis time after the chest X-ray order. We include International Classification of Diseases, Ninth Revision, (ICD-9) codes 480-487 for pneumonia diagnosis. 13 medical school identity, and proportion of radiology exams that are chest X-rays. For each chest X- ray in the sample, we record the time that a radiologist spent to generate the report in minutes and the length of the report in words. For each requesting physician in the sample, we record the number of X-rays ordered across all patients, above-/below-median indicators for their average patient predicted diagnosis or predicted false negative,!* the physician's leave-out shares of pneumonia diagnoses and false negatives, and the physician's leave-out share of orders marked as urgent. In the analysis below, we extend our baseline model to address two limitations of our data, First, our sample includes all chest X-rays, not only those that were ordered for suspicion of pneumonia. If an X-ray was ordered for a different reason such as a rib fracture, it is unlikely even a low-skilled radiologist would incorrectly issue a pneumonia diagnosis. We thus allow for a share x of cases to have s; = 0 and to be recognized as such by all radiologists. We calibrate x using a random-forest algorithm that predicts pneumonia diagnosis based on all characteristics in Appendix Table A.2 and words or phrases extracted from the chest X-ray requisition. We set x = 0.336, which is the proportion of patients with a random-forest predicted probability of pneumonia less than 0.01.4 Second, some cases we code as false negatives due to a pneumonia diagnosis on the second visit may have either been at too early a stage to have been identified even by a highly skilled radiologist, or developed in the interval between the first and second visit. We therefore allow for a share 2 of cases that do not have pneumonia detectable by X-ray at the time of their initial visit to develop it and be diagnosed subsequently. We estimate 2 as part of our structural analysis below. 4 Mbodel-Free Analysis 4.1 Identification For each case i, we observe the assigned radiologist j (i), the diagnosis indicator d;, and the false negative indicator m;. As the number of cases assigned to each radiologist grows large, these data identify the diagnosis rate P; and the miss rate FN; for each j. The data exhibit "one-sided selection," in the sense that the true state is only observed conditional on d; = 0.!4 12These predictions are fitted values from regressing d; or m; on patient demographics. 13 We use an extreme gradient boosting algorithm first introduced in Friedman (2001) and use decision trees as the learner. We train a binary classification model and set the learning rate at 0.15, the maximum depth of a tree at 8, and the number of rounds at 450. We use all variables and all observations in each tree. 14False negatives are observable by construction in our setting as we define s; as cases of pneumonia that will not get better on their own and result in a subsequent observed diagnosis. We conservatively assume that false positives are unobservable, but in practice some cases can present with alternative explanations for a patient's symptoms that would rule out pneumonia. 14 The first goal of our descriptive analysis is to flexibly identify the shares of the classification matrix in Figure I Panel A for each radiologist. This allows us to plot the actual data in ROC space as in Figure Il. The values of P; and FN; would be sufficient to identify the remaining elements of the classification matrix if we also knew the share S; = Pr(s; = 1|j (i) = j) of j's patients who had pneumonia since TP; = S;-FNj; (1) FP; = P; -TP;; and (2) TN; = 1--FN;-TP,-FP;j. (3) Identification of the classification matrix therefore reduces to the problem of identifying the values of Sj. Under random assignment of cases to agents, S; will be equal to the overall population share S = Pr(s; = 1) for all 7. Thus, knowing S would be sufficient for identification. Moreover, the observed data also provide bounds on the possible values of S. If there exists a radiologist j such that P; =0, we would be able to learn S exactly as S = S; = FN;. Otherwise, letting j denote the radiologist with the lowest diagnosis rate (i.e., J = argmin,; P;) we must have § € |F Nijs F Nj + P| 5 We show in Section 5.2 that S is point identified under the additional functional form assumptions of our structural model. We use an estimate of S = 0.051 from our baseline structural model, and we also consider bounds for S, specifically S € [0.015,0.073].1¢ The second goal of our descriptive analysis is to draw inferences about skill heterogeneity and the validity of standard monotonicity assumptions. Even without knowing the value of S, we may be able to reject the hypothesis of uniform skill using just the directly identified objects FN; and P;. From Remark 2 we know that skill is not uniform if there exist j and j' such that ae ¢ [-1,0]. This will be true in particular if 7 has both a higher diagnosis rate (P; > Pj') and a higher miss rate (FN; > FN;). By the discussion in Section 2.3, this rejects the standard monotonicity assumption (Condition 1(ii)) as well as the weaker monotonicity assumptions we consider (Conditions 2 to 4). With additional assumptions, the data may identify a partial or complete ordering of agent skill. Suppose, first, that we set aside the possibility that two agents' signals' may not be comparable in the 15See Arnold et al. (2020) for a detailed discussion and implementation of identification using these boundary conditions. 16To construct these bounds, instead of using the radiologist with the lowest diagnosis rate, we divide all radiologists into ten bins based on their diagnosis rates, construct bounds for each bin using the group weighted average diagnosis and miss tates, and take the intersection of all bounds. See Appendix C for more details. 15 Blackwell ordering and so focus on the case where all agents can be ordered by skill. Then for any j and j' with P; > Pj, oe < -1 implies that agent 7 has strictly higher skill than agent j' and oa > 0 implies that agent j has strictly lower skill than agent j'. The ordering in this case is partial because if ae € [-1,0] we cannot determine which agent is more skilled or reject that their skill is the same. If we further assume (as in our structural model below) that agents' signals come from a known family of distributions indexed by skill a, that all agents have P; € (0,1), and that the signal distributions satisfy appropriate regularity conditions, the data are sufficient to identify each agent's skill.!" Looking at the data in ROC space provides additional intuition for how skill is identified. While knowing the value of S is not necessary for the arguments in the previous two paragraphs, we sup- pose for illustration that this value is known so that the data identify a single point (FPR,,TPR;) in ROC space associated with each agent j.'® Agents j and j' have equal skill if (FPR;,TPR;) and (FPR;,TPR;) lie on a single ROC curve. Since ROC curves must be upward-sloping, we reject uniform skill if there exist j and j' with FPR; < FPR; and TPR; > TPR}. Under the assumption that all agents are ordered by skill, this further implies that j must be strictly more skilled than j'. If signals are drawn from a known family of distributions indexed by a and satisfying appropriate regularity conditions, each value of a corresponds to a distinct non-overlapping ROC curve, and so observing the single point (FPR;,TPR;) is sufficient to identify the value of a; and the slope of the ROC curve at (FPR;,TPR;). Agent preferences are also identified when agents are ordered by skill and signals are drawn from a known family of distributions. If the posterior probability of s; = 1 is continuously increasing in wj;; for any signal, ROC curves must be smooth and concave (see Appendix B for proof). The implied slope of the ROC curve at (FPR;,TPR;) reveals the technological tradeoff between false positives and false negatives, at which j is indifferent between d = 0 and d = 1. This tradeoff identifies j's cost of a false negative relative to a false positive, or 8; = seo € (0,00), which is in turn sufficient to identify the function u; (-,-) up to normalizations (see Appendix B for proof). 17For skill to be identified, the signal distributions need to satisfy regularity conditions guaranteeing that the miss rate FN; achievable for any given diagnosis rate P; is strictly decreasing in skill. Then there is a unique mapping from (F Nj. P;) to skill. 18Richer data could identify more points on a single agent's ROC curve, for example by exploiting variation in prefer- ences (e.g., the cost of diagnosis) for the same agent while holding skill fixed. 16 4.2 Quasi-Random Assignment A key assumption of our empirical analysis is quasi-random assignment of patients to radiologists. Our qualitative research suggests that the typical pattern is for patients to be assigned sequentially to available radiologists at the time their physician orders a chest X-ray. Such assignment will be plausibly quasi-random provided we control for the time and location factors that determine which radiologists are working at the time of each patient's visit (e.g., Chan 2018). Assumption 1 (Conditional Independence). Conditional on the hour of day, day of week, month, and location of patient i's visit, the state s; and potential diagnosis decisions {d; a}, are indepen- JeJeg dent of the assigned radiologist j (i). In practice, we will implement this conditioning by controlling for a vector T; containing hour- of-day, day-of-week, and month-year indicators, each interacted with indicators for the station that i visits. Our results thus require both that Assumption 1 holds and that this additively-separable functional form for the controls is sufficient. We refer to T; as our minimal controls. While we expect assignment to be approximately random in all stations, organization and pro- cedures differ across stations in ways that mean our time controls may do a better job of capturing confounding variation in some stations than others.!9 We will therefore present our main model-free analyses for two sets of stations: the full set of 104 stations, and a subset of 44 of these stations for which we detect no statistically significant imbalance across radiologists in a single characteristic, patient age. Specifically, these 44 stations are all those for which the F-test for joint significance of radiologist dummies in a regression of patient age on those dummies and minimal controls, clustered by radiologist-day, fails to reject at the 10 percent level. To provide evidence on the plausibility of quasi-random assignment, we look at the extent to which our vector of observable patient characteristics is balanced across radiologists conditional on the minimal controls. Paralleling the main regression analysis below, we first define a leave-out measure of the diagnosis propensity of each patient's assigned radiologist, 1( el; ; 4 Trotnt Syl «he @) Zi= 19Ty our qualitative research, we identify at least two types of conditioning sets that are unobserved to us. One is that the population of radiologists in some stations includes both "regular" radiologists who are assigned chest X-rays according to the normal sequential protocol and other radiologists who only read chest X-rays when the regular radiologists are not available or in other special circumstances. A second is that some stations consist of multiple sub-locations, and both patients and radiologists could sort systematically to sub-locations. Since our fixed effects do not capture either radiologist "types" or sub-locations, either of these could lead Assumption 1 to be violated. 17 where J; is the set of patients assigned to radiologist j. We then ask whether Z; is predictable from our main vector X; of patient i's 77 observables after conditioning on the minimal controls. Figure IV presents the results. Panels A and B present individual coefficients from regressions of d; (a patient's own diagnosis status) and Z; (the leave-out propensity of the assigned radiologist), respectively, on the elements of X;, controlling for T;. Continuous elements of X; are standardized. At the bottom of each panel we report F-statistics and p-values for the null hypothesis that all coef- ficients on the elements of X; are equal to zero. Although X; is highly predictive of a patient's own diagnosis status, it has far less predictive power for Z;, with an F-statistic two orders of magnitude smaller and most coefficients close to zero. The small number of variables that are predictive of Z;- most notably characteristics of the requesting physician-are not predictive of d; for the most part, and there is no obvious relationship between their respective coefficients in the regressions of d; and Z;. Panel C presents the analogue of Panel B for the subset of 44 stations with balance on age." Here the F-statistic falls further and the physician ordering characteristics that stand out in the middle panel are no longer individually significant. Thus, these stations which were selected for balance only on age also display balance on the other elements of X;. We present additional evidence of balance below and in the appendix. As an input to this analysis, we form predicted values d; of the diagnosis indicator d;, and m; of the false negative indicator m;, based on respective regressions of d; and m; on X; alone. This provides a low-dimensional projection of X; that isolates the most relevant variation. In Section 4.3, we provide graphical evidence on the magnitude of the relationship between pre- dicted miss rates 7; and radiologist diagnostic propensities Z;, paralleling our main analysis which focuses on the relationship between m; and Z;. This confirms that the relationship with 7; is econom- ically small. We also show in Section 4.3 that our key reduced-form regression coefficient is similar whether we control for none, all, or some of the variables in X;. In Appendix Figure A.2, we show similar results to those in Figure IV using radiologists' (leave- out) miss rates in place of the diagnosis propensities Z;. In Appendix Table A.3, we report F-statistics and p-values analogous to those in Figure IV and Appendix Figure A.2 for subsets of the characteristic vector X;, showing that the main pattern remains consistent across these subsets. In Appendix Table A.4, we compare values of d; and rm; across radiologists with high and low diagnosis and miss rates, similar to a lower-dimensional analogue of the tests in Figure IV and Ap- 20For brevity, we omit the analogue of Panel A for these 44 stations. This is presented in Appendix Figure A.3, and it confirms that the relationship between d; and X; remains qualitatively similar. 18 pendix Figure A.2. The results confirm the main conclusions we draw from Figure IV, showing small differences in the full sample of stations and negligible differences in the 44-station subsample. In Appendix Figure A.4, we present results from a permutation test in which we randomly re- assign d; and 7; across patients within each station after partialing out minimal controls, estimate radiologist fixed effects from regressions of the reshuffled d; and m; on radiologist dummies, and then compute the patient-weighted standard deviation of the estimated radiologist fixed effects within each station. Comparing these to the analogous standard deviation based on the real data provides a permutation-based p-value for balance in each station. We find that these p-values are roughly uni- formly distributed in the 44 stations selected for balance on age, confirming that these stations exhibit balance on characteristics other than age. In Appendix Figure A.5, we present a complementary sim- ulation exercise that suggests that we have the power to reject more than a few percent of patients in these stations being systematically sorted to radiologists. 4.3. Main Results The first goal of our descriptive analysis is to flexibly identify the shares of the classification matrix in Figure I, Panel A, for each radiologist. This allows us to plot the data in ROC space as in Figure II. We first form estimates pos and FN," of each radiologist's risk-adjusted diagnosis and miss rates."1 We then further adjust these for the parameters x and A introduced in Section 3 to arrive at estimates P; and FN; of underlying P; and FN;. We fix the share « of cases not at risk of pneumonia to the estimated value 0.336 discussed in Section 3, and we fix the share A of cases whose pneumonia manifests after the first visit at the value 0.026 estimated in the structural analysis. There is substantial variation in PB, and FNj. Reassigning patients from a radiologist in the 10th percentile of diagnosis rates to a radiologist in the 90th percentile would increase the probability of a diagnosis from 8.9 percent to 12.3 percent. Reassigning patients from a radiologist in the 10th percentile of miss rates to a radiologist in the 90th percentile would increase the probability of a false negative from 0.2 percent to 1.8 percent. Appendix Table A.5 shows these and other moments of radiologist-level estimates. Finally, we solve for the remaining shares of the classification matrix by Equations (1) to (3) and 21 We form these as the fitted radiologist fixed effects from respective regressions of d; and m; on radiologist fixed effects, : ss ws ~ =~obs_.. : : patient characteristics X;, and minimal controls T;. We recenter pobs and FN; * within each station so that the patient- weighted averages within each station are equal to the overall population rate, and truncate these adjusted rates below at zero. This truncation applies to 2 out of 3, 199 radiologists in the case of pows and 45 out of 3, 199 radiologists in the case of FNS". 19 the prevalence rate § = 0.051 which we estimate in the structural analysis. We truncate the estimated values FPR, and TPR; so that they lie in [0,1] and so that TPR; => FPR,.22 Appendix C provides further detail on these calculations. We present estimates of (FPR;,TPR;) in ROC space in Figure V. They show clearly that the data are inconsistent with the assumption of all radiologists lying along a single ROC curve, and instead suggest substantial heterogeneity in skill."? The second goal of our descriptive analysis is to estimate the relationship between radiologists' diagnosis rates P; and their miss rates FN;. We focus on the coefficient estimand A from a linear re- gression of FN; on P; in the population of radiologists. As discussed in Section 2.3, A € [-1,0] is an implication of both the standard monotonicity of Condition 1(iii) and the weaker versions of mono- tonicity we consider as well. Under our maintained assumptions, A ¢ [0,1] implies that radiologists must not have uniform skill and skill must be systematically correlated with diagnostic propensities. Exploiting quasi-experimental variation under Assumption 1, we can recover a consistent estimate of A from a 2SLS regression of m; on d; instrumenting for the latter with the leave-out propensity Z;.-4 In these regressions, we control for the vector of patient observables X; as well as the minimal time and station controls T;. Using the leave-out propensity is a standard approach that prevents overfitting the first stage in finite samples, which would otherwise bias the coefficient toward an OLS estimate of the relationship between m; and d; (Angrist et al. 1999). We show in Appendix Figure A.7 that results are qualitatively similar if we use radiologist dummies as instruments. Figure VI presents the results. To visualize the IV relationship, we estimate the first-stage re- gression of d; on Z; controlling for X; and T;. We then plot a binned scatter of m; against the fitted values from the first stage, residualizing both with respect to X; and T;, and recentering both to their respective sample means. The figure also shows the IV coefficient and standard error. In both the overall sample (Panel A) and in the sample selected for balance on age (Panel B), we show a strongly positive relationship between diagnosis predicted by the instrument and false negatives, controlling for the full set of patient characteristics. This upward slope implies that 22Tmposing TPR; < 1 affects 597 observations (18.7% of the total). Imposing FPR '; 2 0 affects 44 observations. Im- posing TPR; > FPR, affects 68 observations. 23In Appendix Figure A.6, we show how the results change when we set S at the lower bound (S = 0.015) and upper bound (S = 0.073) derived in Section 4.1. The values of TPR and FPR change substantially, but the overall pattern of a negative slope in ROC space remains robust. As discussed in Section 4.1, the sign of the slope of the line connecting any two points in ROC space is in fact identified independently of the value of S, so this robustness is, in a sense, guaranteed. In the same figure, we show that varying the assumed values of 2 and « similarly affects the levels but not the qualitative pattern in ROC space. 24 Observed m; and d; do not account for the parameters x and A, so we are estimating a coefficient A°S froma regression of FNS on pobs, In Appendix C, we show that A € [-1,0] is equivalent to Acs ¢ [-1,--A], which is an even smaller admissible range. 23 We show the first-stage relationship in Appendix Figure A.8. 20 the miss rate is higher for high-diagnosing radiologists not only conditionally (in the sense that the patients they do not diagnose are more likely to have pneumonia) but unconditionally as well. Thus, being assigned to a radiologist who diagnoses patients more aggressively increases the likelihood of leaving the hospital with undiagnosed pneumonia. Under Assumption 1, this implies violations in monotonicity. The only explanation for this under our framework is that high-diagnosing radiologists have less accurate signals, and that this is true to a large enough degree to offset the mechanical negative relationship between diagnosis and false negatives. In Figure VII, we provide additional evidence on whether imbalances in patient characteristics may explain this relationship. This figure is analogous to Figure VI with the predicted false negative m, in place of the actual false negative m;, and controls X; omitted. In the overall sample (Panel A), radiologists with higher diagnosis rates are assigned patients with characteristics that predict more false negatives. However, this relationship is small in magnitude in the full sample and negligible in the subsample comprising 44 stations with balance on age (Panel B). Notably, the positive IV coefficient in Figure VI is even /arger in the latter subsample of stations. In Appendix Figure A.9 we show a scatterplot that collapses the underlying data points from Figure VI to the radiologist level. This plot reveals substantial heterogeneity in miss rates among ra- diologists with similar diagnosis rates: For the same diagnosis rate, a radiologist in the case-weighted 90th percentile of miss rates has a miss rate 0.7 percentage points higher than that of a radiologist in the case-weighted 10th percentile. This provides further evidence against the standard monotonicity assumption, which implies that all radiologists with a given diagnosis rate must also have the same miss rate.2© In Appendix D, we show that our data pass informal tests of monotonicity that are standard in the literature (Dobbie et al. 2018; Bhuller et al. 2020), as shown in Appendix Table A.6. These tests require that diagnosis consistently increases in P; in a range of patient subgroups.2" Thus, together with evidence of quasi-random assignment in Section 4.2, the standard empirical framework would suggest this as a plausible setting in which to use radiologist assignment as an instrument for the treatment variable d;;. However, were we to apply the standard approach and use radiologist assignment as an instrument to estimate an average effect of diagnosis d,;; on false negatives, we would reach the nonsensical con- 26In Appendix Figure A.10, we investigate the IV-implied relationship between diagnosis and false negatives within each station and show that, in the vast majority of stations, the station-specific estimate of A is outside of the bounds of [-1, 0]. 27In Appendix D, we also show the relationship between these standard tests and our test. We discuss that these results suggest that (i) radiologists consider unobserved patient characteristics in their diagnostic decisions; (ii) these unobserved characteristics predict s;; and (iii) their use distinguishes high-skilled radiologists from low-skilled radiologists. 21 clusion that diagnosing a patient with pneumonia (and thus giving them antibiotics) makes them more likely to return with untreated pneumonia in the following days.7® Standard tests of monotonicity may pass while our test may strongly reject monotonicity by A ¢[-1, 0] when monotonicity violations sys- tematically occur along an underlying state s; but not along observable characteristics. In Appendix D, we formally show that our test would be equivalent to a standard test if s; were observable and used as a "characteristic" to form subgroups within which to confirm a positive first stage. 4.4 Robustness Given the small but significant imbalance that we detect in Section 4.2, we examine the robustness of our results to varying controls for patient characteristics as well as the set of stations we consider. We first divide our 77 patient characteristics into 10 groups.*° Next, we run separate regressions using each of the 2!° = 1,024 possible combinations of these 10 groups as controls. Figure VII shows the range of the coefficients from IV regressions analogous to Figure VI across these specifications. The number of different specifications that corresponds to a given number of patient controls may differ. For example, controlling for either no patient characteristics or all patient characteristics each results in one specification. Controlling for n patient characteristics results in "10 choose n" specifications. For each number of characteristics on the x-axis, we plot the minimum, maximum, and mean IV estimate of A. The mean estimate actually increases with more controls, and no specification yields an estimate that is close to 0. Panel A displays results using observations from all stations, and Panel B displays results using observations only from the 44 stations in which we find balance on age. As expected, slope statistics are even more robust in Panel B. 5 Structural Analysis In this section, we specify and estimate a structural model with variation in both skill and preferences. It builds on the canonical selection framework by allowing radiologists to observe different signals of 28 As shown in Appendix Table A.7, in our sample of all stations, we also find that diagnosing and treating pneumonia implausibly increases mortality, repeat ED visits, patient-days in the hospital, and ICU admissions. However, in the sample of 44 stations with balance on age, these effects are statistically insignificant, reversed in sign, and smaller in magnitude. 29We note in Section 2.3 a close connection between our test and tests of IV validity proposed by Kitagawa (2015) and. Mourifie and Wan (2017). Our test maps more directly to monotonicity because we use an "outcome" m; = 1(d; = 0, 5; = 1) that is mechanically defined by d; and s;, so that "exclusion" in Condition 1(i) is satisfied by construction. 30We divide all patient characteristics into five categories in Appendix Table A.2. We further divide the first category (demographics) into six groups: age and gender, marital status, race, religion, indicator for veteran status, and the distance between home and VA station performing X-ray. Combining these six groups with the other four categories gives us 10 groups. 22 patients' true conditions, and so to rank cases differently by their appropriateness for diagnosis. 5.1 Model Patient i's true state s; is determined by a latent index v; ~ N(0,1). If v; is greater than Vv, then the patient has pneumonia: Sj =1(; >Yy). The radiologist j assigned to patient i observes a noisy signal wj; ~ N (0,1) correlated with v;. The strength of the correlation between w;; and v; characterizes the radiologist's skill a; € (0, 1]:3! Vi 0 1 Qj ~N : ; (5) Wij 0 Qj 1 We assume that radiologists know both the cutoff value v and their own skill a;. Note that normalizing the means and variances of v; and w;; to zero and one respectively is without loss of generality. The radiologist's utility is given by -l, if dij = 1,5; = 0, uij = --Bj, if dij =0,5;=1, (6) 0, otherwise. The key preference parameter 6; captures the disutility of a false negative relative to a false positive. Given that the health cost of undiagnosed pneumonia is potentially much greater than the cost of inadvertently giving antibiotics to a patient who does not need them, we expect 6; > 1. We normalize the utility of correctly classifying patients to zero. Note that this parameterization of u; (d,s) with uj(1,1)-4j(0,1) . a single parameter 6; is without loss of generality, in the sense that the ratio B; = 71;(0,0)=u,(1,0) 18 sufficient to determine the agent's optimal decision given the posterior Pr (s; = 1|w;;,a;), as discussed in Section 4.1. In Appendix E.1, we show that the radiologist's optimal decision rule reduces to a cutoff value 7; such that dj; = 1(w;; > 7;). The optimal cutoff r* must be such that the agent's posterior probability 3! The joint-normal distribution of vj and w;j; determines the set of potential shapes of radiologist ROC curves. This simple parameterization implies concave ROC curves above the 45-degree line, attractive features described in Section 2.2. In Appendix Figure A.11, we map the correlation a; to the Area Under the Curve (AUC), which is a common measure of performance in classification. The AUC measures the area under the ROC curve: An AUC value of 0.5 corresponds to classification no better than random chance (i.e., aj= 0), whereas an AUC value of 1 corresponds to perfect classification (e.g., a7 = 1). 23 that s; = 0 after observing w;; = T* is equal to i a . The formula for the optimal threshold is J ¥- 1-3" (4) J ' 1+; t* (aj, Bj) = =. (7) a The cutoff value in turn implies FP; and FN,, which give expected utility E [uj] =-(FP;+BFN;). (8) The comparative statics of the threshold t* with respect to v and £; are intuitive. The higher is v, and thus the smaller the share S of patients who in fact have pneumonia, the higher is the threshold. The higher is 8;, and thus the greater the cost of a missed diagnosis relative to a false positive, the lower is the threshold. The effect of skill a@; on the threshold can be ambiguous. This arises because a; has two distinct effects on the radiologist's posterior on y;: (i) it shifts the posterior mean further from zero and closer to the observed signal w,;; and (ii) it reduces the posterior variance. For a; ~ 0, the radiologist's posterior is close to the prior N (0, 1) regardless of the signal. If pneumonia is uncommon, in particular if ¥ > Oo! (4% ), she will prefer not to diagnose any patients, implying t* ~ oo. As a; increases, effect (1) dominates. This makes any given w,; more informative and so causes the optimal threshold to fall. As a; increases further, effect (11) dominates. This makes the agent less concerned about the risk of false negatives and so causes the optimal threshold to rise. Given Equation (7), we should expect thresholds to be correlated with skill when costs are highly asymmetric (i.e., 6; is far from 1) or, for low skill, when the condition is rare (i.e., V is high). Figure IX shows the relationship between a; and 7; for different values of 6;. Appendix E.1 discusses comparative statics of r* further. In Appendix G.1, we show that a richer model allowing pneumonia severity to impact both the probability of diagnosis and the disutility of a false negative yields a similar threshold-crossing model with equivalent empirical implications. In Appendix G.2, we also explore an alternative formulation in which t; depends on a potentially misinformed belief about a; and an assumed fixed 6; at some social welfare weight 6°. From a social planner's perspective, for a given skill a;, deviations from t* (a;, B°) yield equivalent welfare losses regardless of whether they arise from deviations of 8; from B* or from deviations of beliefs about a; from the truth. If we know a radiologist's FPR; and TPR; in ROC space, then we can identify her skill a; by the shape of potential ROC curves, as discussed in Section 4.1, and her preference 6; by her diagnosis 24 rate and Equation (7). Equation (5) determines the shape of potential ROC curves and implies that they are smooth and concave, consistent with utility maximization. It also guarantees that two ROC curves never intersect and that each (FPR;,T PR;) point lies on only one ROC curve. The parameters « and A can be identified by the joint-normal signal structure implied by Equation (5). With 2 = 0, a radiologist with FPR; ~ 0 must have a nearly perfectly informative signal and so should also have TPR; ~ 1. We in fact observe that some radiologists with no false positives still have some false negatives, and the value of A is determined by the size of this gap. Similarly, with k = 0, a radiologist with TPR; ~ 1 should either have perfect skill (implying FPR; ~ 0) or simply diagnose everyone (implying FPR; ~ 1). So the value of x is identified if we observe a radiologist j with TPR; ~ 1 and with FPR; far from 0 and 1, as the fraction of cases that j does not diagnose. In our estimation described below, we do not estimate « but rather calibrate it from separate data as described in Section 3.22 5.2. Estimation We estimate the model using observed data on diagnoses d; and false negatives m;. Recall that we observe m; = 0 for any i such that d; = 1, and m; = 1 is only possible if d; = 0. We define the following probabilities, conditional on y; = (@;,8;): Py (yj) = Pr ( wij >T; vi); pa (yj) = Pr ( wi <i >9y;); P3j (7;) = Pr ( wi < TV <3|7))- The likelihood of observing (d;,m;) for a case i assigned to radiologist j (i) is (1-«) pi (raw), if d; = 1, Zi (dimilyj@) = 4 (1--«) (p27 (vi@) +4P3 (Yim), if di = 04m; = 1, (1-«)(1-A)p3; (yi@) +% if d; = 0,m; = 0. 32while « is in principle identified, radiologists with the highest TPR; have FPR; ~ 0 and do not have the highest diagnosis rate. These radiologists appear to have close to perfect skill, which is consistent with any x. Thus, we cannot identify «x in practice. In Appendix Table A.10, we show that our results and their policy implications do not depend qualitatively on our choice of x. 25 For the set of patients assigned to j, J; = {i: j@) = j}, the likelihood of d; = {dibier, and m; = {m; hier, is Z(djmjl7;) = |] Ald milrio) ielj n" JI («py (vi) (A-® (xy (v1) + AY (77) ---né--n™ -(1-«)(1-A) ps3; (vim) +), d _ ym - . =|, ; d ym . where ny = diel; d;, ne = diel m;, and nj = [Z)|. From the above expression, n;, ni, and nj; are sufficient statistics of the likelihood of d; and m,, and we can write the radiologist likelihood as Lj (n¢,n",nj|7;). Given the finite number of cases per radiologist, we additionally make an assumption on the population distribution of a; and £; across radiologists to improve power. Specifically, we assume Bj Hp PTag OG where a; = 5 (1+tanh@;) € (0,1) and B; = exp B; > 0. We set p = 0 in our baseline specification but allow its estimation in Appendix F. Finally, to allow for potential deviations from random assignment, we fit the model to counts of diagnoses and false negatives that are risk-adjusted to account for differences in patient characteristics X; and minimal controls T;. We begin with the risk-adjusted radiologist diagnosis and miss rates probs and FN; defined in Section 4.3. We then impute diagnosis and false negative counts ng = nj Pos and ne = njFN,, where n; is the number of patients assigned to radiologist j, and the imputed counts are not necessarily integers. Ina second step, we maximize the following log-likelihood to estimate the hyperparameter vector d= (He: He, Oa, op, A,¥): 6 = argmax ) tog 2; (ad, nj]73) Ff (7118) dy;. 7 We compute the integral by simulation, described in further detail in Appendix E.2. Given our esti- mate of @ and each radiologist's risk-adjusted data, (ad, ni, nj), we can also form an empirical Bayes posterior mean of each radiologist's skill and preference (a;, 8;), which we describe in Appendix E.3. 26 Our risk-adjustment approach can be seen as fitting the model to an "average" population of patients and radiologists whose distribution of diagnosis and miss rates are the same as the risk- adjusted values we characterize in our reduced-form analysis. An alternative would be to incorporate heterogeneity by station, time, and patient characteristics explicitly in the structural model-e.g., allowing these to shift the distribution of patient health. While this would be more coherent from a structural point of view, doing it with sufficient flexibility to guarantee quasi-random assignment would be computationally challenging. We show in Section 5.4 below that our main results are qualitatively similar if we exclude X; from risk adjustment or even omit the risk-adjustment step altogether. We show evidence from Monte Carlo simulations in Appendix G.3 that our linear risk adjustment is highly effective in addressing bias due to variation in risk across groups of observations even when it is misspecified as additively separable. 5.3 Results Panel A of Table I shows estimates of the hyperparameter vector @ in our baseline specification. Panel B of Table I shows moments in the distribution of posterior means of (a;,8;) implied by the model parameters. In the baseline specification, the mean radiologist skill is relatively high, at 0.85. This implies that the average radiologist receives a signal that has a correlation of 0.85 with the patient's underlying latent state v;. This correlation is 0.76 for a radiologist at the 10th percentile of this skill distribution and is 0.93 for a radiologist at the 90th percentile of the skill distribution. The average radiologist preference weights a false negative 6.71 times as high as a false positive. This relative weight is 5.60 at the 10th percentile of the preference distribution and is 7.91 the 90th percentile of this distribution. In Appendix Figure A.12, we compare the distributions of observed data moments of radiologist diagnosis and miss rates with those simulated from the model at the estimated parameter values.?> In all cases, the simulated data match the observed data closely. In Figure IX, we display empirical Bayes posterior means for (a;,8;) in a space that represents optimal diagnostic thresholds. The relationship between skill and diagnostic thresholds is mostly pos- itive. As radiologists become more accurate, they diagnose fewer people (their thresholds increase), 33-We construct simulated moments as follows. We first fix the number of patients each radiologist examines to the actual number. We then simulate patients at risk from a binomial distribution with the probability of being at risk of 1-- x. For patients at risk, we simulate their underlying true signal and the radiologist-observed signal, or vj and w;;, respectively, using our posterior mean for a;. We determine which patients are diagnosed with pneumonia and which patients are false negatives based on * (a jp B i)» y;, and v. We finally simulate patients who did not initially have pneumonia but later develop it with A. 27 since the costly possibility of making a false negative diagnosis decreases. In Appendix Figure A.13, we show the distributions of the empirical Bayes posterior means for a@;, 8;, and t;, and the joint distribution of a; and £;. Finally, in Appendix Figure A.14, we transform empirical Bayes posterior means for (a;,8;) into moments in ROC space. The relationship between TPR; and FPR; implied by the empirical Bayes posterior means is similar to that implied by the flexible projection shown earlier in Figure V. 5.4 Robustness In Appendix F, we explore alternative samples, controls, and structural estimation approaches. To evaluate robustness to potential violations in quasi-random assignment, we estimate our model re- stricting to data from 44 stations with quasi-random assignment selected in Section 4.2. To assess robustness to our risk-adjustment procedure, we also estimate our model with moments that omit patient characteristics X; from the risk-adjustment procedure, and we estimate the model omitting the risk-adjustment step altogether, plugging raw counts (n2, n; 5 nj) directly into the likelihood. To address potential endogenous return ED visits, we restrict our sample to only heavy VA users. To ad- dress potential endogenous second diagnoses, we restrict false negatives to cases of pneumonia that required inpatient admission. Finally, we consider sensitivity to alternative assumptions. First, we estimate an alternative model that allows for flexible correlation p. While 2 and p are separately identified in the data, they are difficult to separately estimate, so we fix p = 0 in the baseline model.*4 In the alternative approach, we fix 2 = 0.026 and allow for flexible p. Second, we consider alternative values for «x and report results in Appendix Table A.10. Our main qualitative findings are robust across all of these alternative approaches. Both reduced- form moments and estimated structural parameters are qualitatively unchanged. As a result, our decompositions of variation into skill and preferences, discussed in Section 6, are also unchanged. 5.5 Heterogeneity To provide suggestive evidence on what may drive variation in skill and preferences, we project our empirical Bayes posterior means for (a;,8;) onto observed radiologist characteristics. Figure A.15 34We do not have many points representing radiologists with many cases who exactly have FPR; = 0. Points in (F PR;,TPR;) space with FPR; ~ 0 and TPR; < 1 can be rationalized by A > 0, a very negative p, or some combina- tion of both. With infinite data, we should be able to separately estimate A and p, but with finite data, it is difficult to fit both A and p. 28 shows the distribution of observed characteristics across bins defined by empirical Bayes posterior means of skill a;. Appendix Figure A.16 shows analogous results for the preference parameter £;. As shown in Figure A.15, higher-skilled radiologists are older and more experienced (Panel A).*> Higher-skilled radiologists also tend to read more chest X-rays as a share of the scans they read (Panel B). Interestingly, those who are more skilled spend more time generating their reports (Panel C), suggesting that skill may be a function of effort as well as characteristics like training or talent. Radiologists with more skill also issue shorter rather than longer reports (Panel D), possibly pointing to clarity and efficiency of communication as a marker of skill. There is little correlation between skill and the rank of the medical school a radiologist attended (Panel E). Finally, higher-skilled radiologists are more likely to be male, in part reflecting the fact that male radiologists are older and tend to be more specialized in reading chest X-rays (Panel F). The results for the preference parameter £;, in Appendix Figure A.16, tend to go in the opposite direction. This reflects the fact that our empirical Bayes estimates of a; and £; are slightly negatively correlated. It is important to emphasize that large variation in characteristics remains, even conditional on skill or preference. This is broadly consistent with the physician practice-style and teacher value- added literature, which demonstrate large variation in decisions and outcomes that appear uncorre- lated with physician or teacher characteristics (Epstein and Nicholson 2009; Staiger and Rockoff 2010). 6 Policy Implications 6.1 Decomposing Observed Variation To assess the relative importance of skill and preferences in driving observed decisions and outcomes, we simulate counterfactual distributions of decisions and outcomes in which we eliminate variation in skill or preferences separately. We first simulate model primitives (a;,8;) from the estimated parameters. Then we eliminate variation in skill by imposing a; = a, where @ is the mean of a;, while keeping 8; unchanged. Similarly, we eliminate variation in preferences by imposing 8; = B, where f is the mean of B };, while keeping a; unchanged. For baseline and counterfactual distributions of underlying primitives-(a;, B;), (@, 8;), and (<;,8)-we simulate a large number of observations 35These results are based on a model that allows underlying primitives to vary by radiologist j and age bin t (we group five years as an age bin), where within j, q¢ and yg each change linearly with t. We estimate a positive linear trend for Ha and a slightly negative trend for yg. We find similar relationships when we assess radiologist tenure on the job and log number of prior chest X-rays. 29 per radiologist to approximate the shares P; and FN; for each radiologist. Eliminating variation in skill reduces variation in diagnosis rates by 39 percent and variation in miss rates by 78 percent. On the other hand, eliminating variation in preferences reduces variation in diagnosis rates by 29 percent and has no significant effect on variation in miss rates.2© These decom- position results suggest that variation in skill can have first-order impacts on variation in decisions, something the standard model of preference-based selection rules out by assumption. 6.2 Policy Counterfactuals We also evaluate the welfare implications of policies aimed at observed variation in decisions or at underlying skill. Welfare depends on the overall false positive FP and the overall false negative FN. We denote these objects under the status quo as FP® and FN°, respectively. We then define an index of welfare relative to the status quo: FP+B°FN FP + BSFNO' (10) where f° is the social planner's relative welfare loss due to false negatives compared to false positives. This index ranges from W = 0 at the status quo to W = 1 at the first best of FP = FN = 0. Itis also possible that W < 0 under a counterfactual policy that reduces welfare relative to the status quo. We estimate FP® and FN® based on our model estimates as 1 FP® = njFP (a;,7* (a;,B;;¥)3V); Spay SED oe (By) 7) 1 FN° = ) n;FN (a;,7* (a;, B;3¥) 3). LYinj , J (a; (a; J ) ) Here, t*(a, 8; 7) denotes the optimal threshold given the evaluation skill a, the preference #, and the disease prevalence ¥. We simulate a set of 10,000 radiologists, each characterized by (a;,8;), from the estimated hyperparameters. We then consider welfare under counterfactual policies that eliminate diagnostic variation by imposing diagnostic thresholds on radiologists. In Table II, we evaluate outcomes under two sets of counterfactual policies. Counterfactuals 1 and 2 focus on thresholds, while Counterfactuals 3 to 6 aim to improve skill. 36Panel B of Appendix Table A.8 shows these baseline results and standard errors, as well as corresponding results under alternative specifications described in Section 5.4. Appendix Figure A.17 shows implications for variation in diagnosis rates and for variation in miss rates under a range of reductions in variation in skill or reductions in variation in preferences. 30 Counterfactual 1 imposes a fixed diagnostic threshold to maximize welfare: aH dj nj (FP (a;,7;¥) + BS FN (a;,7;7)) FP° + BS FN° T (6°) = argmax 4 1- where ¥ and the simulated set of a; are derived from our baseline model in Section 5. Despite the objective to maximize welfare, a fixed diagnostic threshold may actually reduce welfare relative to the status quo by imposing this constraint. On the other hand, Counterfactual 2 allows diagnostic thresh- olds as a function of a;, implementing 7;(B*) = t* (@;,8°;¥). This policy should weakly increase welfare and outperform Counterfactual 1. In Counterfactuals 3 to 6, we consider alternative policies that improve diagnostic skill, for ex- ample by training radiologists, selecting radiologists with higher skill, or aggregating signals so that decisions use better information. In Counterfactuals 3 to 5, we allow radiologists to choose their own diagnostic thresholds, but we improve the skill a; of all radiologists at the bottom of the dis- tribution to a minimum level. For example, in Counterfactual 3, we improve skill to the 25th per- centile a5, setting aj = a> for any radiologist below this level. The optimal thresholds are then Tj = T*(max (a;,@7>),8;;7). Counterfactual 6 forms random two-radiologist teams and aggregates signals of each team member under the assumption that the two signals are drawn independently.*" Table II shows outcomes and welfare under 6° = 6.71, matching the mean radiologist preference f;. We find that imposing a fixed diagnostic threshold (Counterfactual 1) would actually reduce wel- fare. Although this policy reduces aggregate false positives, it increases aggregate false negatives, which are costlier. Imposing a threshold that varies optimally with skill (Counterfactual 2) must im- prove welfare, but we find that the magnitude of this gain is small. In contrast, improving diagnostic skill reduces both false negatives and false positives and substantially outperforms threshold-based policies. Combining two radiologist signals (Counterfactual 6) improves welfare by 35% of the differ- ence between status quo and first best. Counterfactual policies that improve radiologist skill naturally reclassify a much higher number of cases than policies that simply change diagnostic thresholds, since improving skill will reorder signals, while changing thresholds leaves signals unchanged. Table II also shows aggregate rates of diagnosis and "reclassification," counting changes in clas- sification (i.e., diagnosed or not) between the status quo and the counterfactual policy. Under all of the policies we consider, the numbers of reclassified cases are greater, sometimes dramatically, than 37Tn practice, the signals of radiologists working in the same location may be subject to correlated noise. In this sense, we view this counterfactual as an upper bound of information from combining signals. 31 net changes in the numbers of diagnosed cases. Figure A.18 shows welfare changes as a function of the social planner's preferences 6°. In this figure, we consider Counterfactuals 1 and 3 from Table II. We also show the welfare gain a planner would expect if she set a fixed threshold under the incorrect assumption that radiologists have uniform diagnostic skill. In this calculation, we assume that the planner assumes a common diagnostic skill parameter @ that rationalizes F P° and F N° with some estimate of disease prevalence V'. In this "mistaken policy counterfactual," the planner would conclude that a fixed threshold would modestly increase welfare. In the range of 6° spanning radiologist preferences from the 10th to 90th percentiles (Table I and Appendix Figure A.13), the skill policy outperforms the threshold policy, regardless of the policy-maker's belief on the heterogeneity of skill. The threshold policy only out- performs the skill policy when 8° diverges significantly from radiologist preferences. For example, if 8° = 0, the optimal policy is trivial: No patient should be diagnosed with pneumonia. In this case, there is no gain to improving skill but there is a large gain to imposing a fixed threshold since radiologists' preferences deviate widely from the social planner's preferences. 6.3 Discussion We show that dimensions of "skill" and "preferences" have different implications for welfare and policy. Each of these dimensions likely captures a range of underlying factors. In our framework, "skill" captures the relationship between a patient's underlying state and a radiologist's signals about the state. We attribute this mapping to the radiologist since quasi-random assignment to radiolo- gists implies that we are isolating the causal effect of radiologists. As suggested by the evidence in Section 5.5, "skill" may reflect not only underlying ability but also effort. Furthermore, in this setting, radiologists may form their judgments with the aid of other clinicians (e.g., residents, fel- lows, non-radiologist clinicians) and must communicate their judgments to other physicians. Skill may therefore reflect not only the quality of signals that the radiologist observes directly, but also the quality of signals that she (or her team) passes on to other clinicians. What we call "preferences" encompass any distortion from the optimal threshold implied by (i) the social planner's relative disutility of false negatives, or 6°, and (ii) each radiologist's skill, or a;. These distortions may arise from intrinsic preferences or external incentives that cause radiologist 8; to differ from 8*. Alternatively, as we elaborate in Appendix G.2, equivalent distortions may arise from radiologists having incorrect beliefs about their own skill a;. For purposes of welfare analysis, the mechanisms underlying "preferences" or "skill" do not 32 matter in so far as they map to an optimal diagnostic threshold and deviations from it. However, practical policy implications (e.g., whether we train radiologists to read chest X-rays, collaborate with others, or communicate with others) will depend on institution-specific mechanisms. 7 Conclusion In this paper, we decompose the roots of practice variation in decisions across radiologists into di- mensions of skill and preferences. The standard view in much of the literature is to assume that such practice variation in many settings results from variation in preferences. We first show descriptive evidence that runs counter to this view: Radiologists who diagnose more cases with a disease are also the ones who miss more cases that actually have the disease. We then apply a framework of clas- sification and a model of decisions that depend on both diagnostic skill and preferences. Using this framework, we demonstrate that the source of variation in decisions can have important implications for how policymakers should view the efficiency of variation and for the ideal policies to address such variation. In our case, variation in skill accounts for 39 percent of the variation in diagnostic deci- sions, and policies that improve skill result in potentially large welfare improvements, while policies to impose uniform diagnosis rates may reduce welfare. Our approach may be applied to settings with the following conditions: (i) quasi-random assign- ment of cases to decision-makers, (ii) an objective to match decisions to underlying states, and (iii) signals of a case's underlying state may be observable to the analyst under at least one of the de- cisions. Many settings of interest may meet these criteria. For example, physicians aim to match diagnostic and treatment decisions to each patient's underlying disease state (Abaluck et al. 2016; Mullainathan and Obermeyer 2019). Judges aim to match bail decisions to whether a defendant will recidivate (Kleinberg et al., 2018). Under these conditions, this framework can be used to decompose observed variation in decisions and outcomes into policy-relevant measures of skill and preferences. Our framework also contributes to an active and growing judges-design literature that uses vari- ation across decision-makers to estimate the effect of a decision on outcomes (e.g., Kling 2006). In this setting, we demonstrate a practical test of monotonicity revealed by miss rates (i.e., A € [-1,0]), drawing on intuition delineated previously in the case of binary instruments (Kitagawa 2015; Balke and Pearl 1997). This generalizes to testing whether cases that suggest an underlying state relevant for classification-e.g., subsequent diagnoses, appellate court decisions (Norris 2019), or discovery of contraband (Feigenberg and Miller 2020)-have proper density (i.e., Pr(s; = 1) € [0,1]) among 33 compliers. We show that, while such tests may be stronger than those typically used in the judges- design literature, they nevertheless correspond to a weaker monotonicity assumption that intuitively relates treatment propensities to skill and implies the "average monotonicity" concept of Frandsen et al. (2019). The behavioral foundation of our empirical framework also provides a way to think about when the validity of the judges design may be at risk due to monotonicity violations. Diagnostic skill may be particularly important to account for when agents require expertise to match decisions to underlying states, when this expertise likely varies across agents, and when costs between false negatives and false positives are highly asymmetric. When all three of these conditions are met, we may have a priori reason to expect correlations between diagnostic skill and propensities, potentially casting doubt on the validity of the standard judges design. Our work suggests further testing to address this doubt. Finally, since the judges design relies on comparisons between agents of the same skill, our approach to measuring skill may provide a path for future research designs that correct for bias due to monotonicity violations by conditioning on skill. In Appendix G.4, we run a Monte Carlo simulation as a proof of concept for this possibility. STANFORD UNIVERSITY, DEPARTMENT OF VETERANS AFFAIRS, AND NATIONAL BUREAU OF ECONOMIC RESEARCH STANFORD UNIVERSITY AND NATIONAL BUREAU OF ECONOMIC RESEARCH STANFORD UNIVERSITY 34 References ABALUCK, J., L. AGHA, C. KABRHEL, A. RAJA, AND A. VENKATESH (2016): "The Determinants of Productivity in Medical Testing: Intensity and Allocation of Care," American Economic Review, 106, 3730-3764. ABUJUDEH, H. H., G. W. BOLAND, R. KAEWLAI, P. RABINER, E. F. HALPERN, G. S. GAZELLE, AND J. H. THRALL (2010): "Abdominal and Pelvic Computed Tomography (CT) Interpretation: Discrepancy Rates Among Experienced Radiologists," European Radiology, 20, 1952-1957. ANGRIST, J. D., G. W. IMBENS, AND A. B. KRUEGER (1999): "Jackknife Instrumental Variables Estimation," Journal of Applied Econometrics, 14, 57-67. ANWAR, S. AND H. FANG (2006): "An Alternative Test of Racial Prejudice in Motor Vehicle Searches: Theory and Evidence,' American Economic Review, 96, 127-151. ARNOLD, D., W. DOBBIE, AND C. S. YANG (2018): "Racial Bias in Bail Decisions," Quarterly Journal of Economics, 133, 1885-1932. ARNOLD, D., W. S. DOBBIE, AND P. HULL (2020): "Measuring Racial Discrimination in Bail Decisions," Working Paper 26999, National Bureau of Economic Research. BALKE, A. AND J. PEARL (1997): "Bounds on Treatment Effects from Studies with Imperfect Com- pliance," Journal of the American Statistical Association, 92, 1171-1176. BERTRAND, M. AND A. SCHOAR (2003): "Managing with Style: The Effect of Managers on Firm Policies,' Quarterly Journal of Economics, 118, 1169-1208. BHULLER, M., G. B. DAHL, K. V. LOKEN, AND M. MOGSTAD (2020): "Incarceration, Recidivism, and Employment," Journal of Political Economy, 128, 1269-1324. BLACKWELL, D., (1953): "Equivalent Comparisons of Experiments,' Annals of Mathematical Statis- tics, 24, 265-272. CHAN, D. C. (2018): "The Efficiency of Slacking Off: Evidence from the Emergency Department," Econometrica, 86, 997-1030. CHANDRA, A., D. CUTLER, AND Z. SONG (2011): "Who Ordered That? The Economics of Treat- ment Choices in Medical Care," in Handbook of Health Economics, Elsevier, vol. 2, 397-432. 35 CHANDRA, A. AND D. O. STAIGER (2007): "Productivity Spillovers in Healthcare: Evidence from the Treatment of Heart Attacks," Journal of Political Economy, 115, 103-140. (2020): "Identifying Sources of Inefficiency in Health Care," Quarterly Journal of Eco- nomics, 135, 785-843. CURRIE, J. AND W. B. MACLEOD (2017): "Diagnosing Expertise: Human Capital, Decision Mak- ing, and Performance among Physicians," Journal of Labor Economics, 35, 1-43. DOBBIE, W., J. GOLDIN, AND C. S. YANG (2018): "The Effects of Pretrial Detention on Conviction, Future Crime, and Employment: Evidence from Randomly Assigned Judges," American Economic Review, 108, 201-240. DOYLE, J. J.. S. M. EWER, AND T. H. WAGNER (2010): "Returns to Physician Human Capital: Evidence from Patients Randomized to Physician Teams," Journal of Health Economics, 29, 866- 882. DOYLE, J. J., J. A. GRAVES, J. GRUBER, AND S. KLEINER (2015): "Measuring Returns to Hospital Care: Evidence from Ambulance Referral Patterns," Journal of Political Economy, 123, 170-214. EPSTEIN, A. J. AND S. NICHOLSON (2009): "The Formation and Evolution of Physician Treatment Styles: An Application to Cesarean Sections," Journal of Health Economics, 28, 1126-1140. FABRE, C., M. PROISY, C. CHAPUIS, S. JOUNEAU, P. A. LENTZ, C. MEUNIER, G. MAHE, AND M. LEDERLIN (2018): "Radiology Residents' Skill Level in Chest X-Ray Reading," Diagnostic and Interventional Imaging, 99, 361-370. FEIGENBERG, B. AND C. MILLER (2020): "Racial Disparities in Motor Vehicle Searches Cannot Be Justified by Efficiency," Working Paper 27761, National Bureau of Economic Research. FIGLIO, D. N. AND M. E. LUCAS (2004): "Do High Grading Standards Affect Student Perfor- mance?" Journal of Public Economics, 88, 1815-1834. FILE, T. M. AND T. J. MARRIE (2010): "Burden of Community-Acquired Pneumonia in North American Adults," Postgraduate Medicine, 122, 130-141. FISHER, E. S., D. E. WENNBERG, T. A. STUKEL, D. J. GOTTLIEB, F. L. LUCAS, AND E. L. PINDER (2003a): "The Implications of Regional Variations in Medicare Spending. Part 1: The Content, Quality, and Accessibility of Care," Annals of Internal Medicine, 138, 273-287. 36 (2003b): "The Implications of Regional Variations in Medicare Spending. Part 2: Health Outcomes and Satisfaction with Care," Annals of Internal Medicine, 138, 288-298. FRANDSEN, B. R., L. J. LEFGREN, AND E. C. LESLIE (2019): "Judging Judge Fixed Effects," Working Paper 25528, National Bureau of Economic Research. FRANKEL, A. (2021): "Selecting Applicants," Econometrica, 89, 615-645, FRIEDMAN, J. H. (2001): "Greedy Function Approximation: A Gradient Boosting Machine," Annals of Statistics, 1189-1232. GARBER, A. M. AND J. SKINNER (2008): "Is American Health Care Uniquely Inefficient?" Journal of Economic Perspectives, 22, 27-50. GOWRISANKARAN, G., K. JOINER, AND P.-T. LEGER (2017): "Physician Practice Style and Healthcare Costs: Evidence from Emergency Departments," Working Paper 24155, National Bu- reau of Economic Research. HECKMAN, J. J. AND E. VYTLACIL (2005): "Structural Equations, Treatment Effects, and Econo- metric Policy Evaluation," Econometrica, 73, 669-738. HEISS, F. AND V. WINSCHEL (2008): "Likelihood Approximation by Numerical Integration on Sparse Grids," Journal of Econometrics, 144, 62-80. HOFFMAN, M., L. B. KAHN, AND D. LI (2018): "Discretion in Hiring,' Quarterly Journal of Economics, 133, 765-800. IMBENS, G. W. AND J. D. ANGRIST (1994): "Identification and Estimation of Local Average Treat- ment Effects," Econometrica, 62, 467-475. IMBENS, G. W. AND D. B. RUBIN (1997): "Estimating Outcome Distributions for Compliers in Instrumental Variables Models," Review of Economic Studies, 64, 555-574. INSTITUTE OF MEDICINE (2013): Variation in Health Care Spending: Target Decision Making, Not Geography, National Academies Press. (2015): Improving Diagnosis in Health Care, National Academies Press. KITAGAWA, T. (2015): "A Test for Instrument Validity," Econometrica, 83, 2043-2063. 37 KLEINBERG, J., H. LAKKARAJU, J. LESKOVEC, J. LUDWIG, AND §. MULLAINATHAN (2018): "Human Decisions and Machine Predictions," Quarterly Journal of Economics, 133, 237-293. KLING, J. R. (2006): "Incarceration Length, Employment, and Earnings," American Economic Re- view, 96, 863-876. KUNG, H.-C., D. L. HOYERT, J. XU, AND S. L. MURPHY (2008): "Deaths: Final Data for 2005," National Vital Statistics Reports: From the Centers for Disease Control and Prevention, National Center for Health Statistics, National Vital Statistics System, 56, 1-120. LEAPE, L. L., T. A. BRENNAN, N. LAIRD, A. G. LAWTHERS, A. R. LOCALIO, B. A. BARNES, L. HEBERT, J. P. NEWHOUSE, P. C. WEILER, AND H. HIATT (1991): "The Nature of Adverse Events in Hospitalized Patients,' New England Journal of Medicine, 324, 377-384. MACHADO, C., A. M. SHAIKH, AND E, J. VYTLACIL (2019): "Instrumental Variables and the Sign of the Average Treatment Effect," Journal of Econometrics, 212, 522-555. MOLIToR, D. (2017): "The Evolution of Physician Practice Styles: Evidence from Cardiologist Migration," American Economic Journal: Economic Policy, 10, 326-356. MOURIFIE, I. AND Y. WAN (2017): "Testing Local Average Treatment Effect Assumptions," Review of Economics and Statistics, 99, 305-313. MULLAINATHAN, S. AND Z. OBERMEYER (2019): "A Machine Learning Approach to Low- Value Health Care: Wasted Tests, Missed Heart Attacks and Mis-Predictions,' Working Paper 26168, National Bureau of Economic Research. NorRIS, S. (2019): "Examiner Inconsistency: Evidence from Refugee Appeals," Working Paper 2018-75, University of Chicago, Becker Friedman Institute of Economics. RIBERS, M. A. AND H. ULLRICH (2019): "Battling Antibiotic Resistance: Can Machine Learning Improve Prescribing?" DIW Berlin Discussion Paper 1803. RUBIN, D. B. (1974): "Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies," Journal of Educational Psychology, 66, 688-701. RUUSKANEN, O., E. LAHTI, L. C. JENNINGS, AND D. R. MURDOCH (2011): "Viral Pneumonia," Lancet, 377, 1264-1275. 38 SELF, W. H., D. M. COURTNEY, C. D. MCNAUGHTON, R. G. WUNDERINK, AND J. A. KLINE (2013): "High Discordance of Chest X-Ray and Computed Tomography for Detection of Pul- monary Opacities in ED Patients: Implications for Diagnosing Pneumonia," American Journal of Emergency Medicine, 31, 401-405. SHOJANIA, K. G., E. C. BURTON, K. M. MCDONALD, AND L. GOLDMAN (2003): "Changes in Rates of Autopsy-Detected Diagnostic Errors Over Time: A Systematic Review," JAMA, 289, 2849-2856. SILVER, D. (2020): "Haste or Waste? Peer Pressure and Productivity in the Emergency Department," Working Paper, Princeton University, Princeton, NJ. STAIGER, D. O. AND J. E. ROCKOFF (2010): "Searching for Effective Teachers with Imperfect Information," Journal of Economic Perspectives, 24, 97-118. STERN, S. AND M. TRAJTENBERG (1998): "Empirical Implications of Physician Authority in Phar- maceutical Decisionmaking,"' Working Paper 6851, National Bureau of Economic Research. THOMAS, E. J., D. M. STUDDERT, H. R. BURSTIN, E. J. ORAV, T. ZEENA, E. J. WILLIAMS, K. M. HOWARD, P. C. WEILER, AND T. A. BRENNAN (2000): "Incidence and Types of Adverse Events and Negligent Care in Utah and Colorado," Medical Care, 38, 261-271. VAN PARYS, J. AND J. SKINNER (2016): "Physician Practice Style Variation: Implications for Pol- icy," JAMA Internal Medicine, 176, 1549-1550. VYTLACIL, E. (2002): "Independence, Monotonicity, and Latent Index Models: An Equivalence Result," Econometrica, 70, 331-341. 39 'oUI] sadop-Cp Oy) Woy JoquINy (0 JoYBTY) soaIND DOY ayeIND0N JOU YIM '(¢) UOTENby Ul sINjoNI)s [eUSIS oy) SUTUUINSSe POUIO] ore DINGY STY) UI UMOYS SOAIND JOY Jejnored oul, "(Wd 4) Wel SAMISOd asye] Oy} pue (Yd) sIeI SATIsod ony} oy] UsoMIEg ArysuOTear oy} SMOTS IT "SAIN (DOW) OsUajovIeyO Surjerodo JoAToeI 9U} Jo sojduexe sjo[d g Joueg 'oje}s & sJUasoIdal UUUNTOS Yous pue UOIsIoop eB sjuasaidal MOI ory 'soje)s pUR SUOISIOOp UO SuTpuedep soulooyno yuIOf INoJ ZuNUesIdel XLVUL UOTLOYIsseyo prepuLys oy} sMoys VY JOUR /aJ0Ny W2{QoIg UOTJBOYIsse[D oy) SUIZITeNSTA, aye aAsod asye4 00" | SZ'0 os'0 G20 00°0 DAIND (DOW) osuajovleyD SuyeisdE IsAtao9y 2g - 00°0 + S20 - 0S'0 + S20 + OO'} [ omsty eyed OAIISOd end Ni+dd _ dd Ndt+dl _ aL dd dL Bey SADISOY STR] BY SATISOg onLL, (N.D) wa Ten oarjeson| DATIPSON ONT], oaneSonl Ose Peisse[D JOINT a a (dL) SAnIsod sANISOg os[ey SATIISOg ONT], Peyissel[D SATIeSON [eNIOV SATIISOg [eEnpy Xe] UONeOYIsse[D °V 40 A: Varying Preferences 1.004 0.75 4 0.50 4 True positive rate 0.25 4 0.00 4 0.00 0.25 0.50 0.75 1.00 False positive rate B: Varying Skill 1.004 0.75 4 0.50 4 True positive rate 0.25 4 0.00 4 0.50 0.75 False positive rate Figure IT Hypothetical Data Generated by Variation in Preferences vs. Skill Note: This figure shows two distributions of hypothetical data in ROC space. The top panel fixes skill and varies preferences. All agents are located on the same ROC curve and are faced with the tradeoff between sensitivity (TPR) and specificity (1 - FPR). The bottom panel fixes the preference and varies evaluation skill. Agents are located on different ROC curves but have parallel indifference curves. 41 Figure III Example Chest X-rays Note: This figure shows example chest X-rays reproduced from Figure 2 of Fabre et al. (2018). These chest X-rays represent cases on which there is expert consensus and which are used for training radiologists. Only Panel E represents a case of infectious pneumonia, and we add a red oval to denote where the pneumonia lies, in the right lower lobe. Panel A shows miliary tuberculosis; Panel B shows a lung nodule (cancer) in the left upper lobe; Panel C shows usual interstitial pneumonitis; Panel D shows left upper lobe atelectasis; Panel F shows right upper lobe atelectasis. 42 A: Diagnosis B: Leave-Out C: Leave-Out (All) (All) (Balanced on Age) Age - o is o Female ® 2 ' Married » 2 D Catholic 4 ® ie q Baptist 4 > H t hite 7 y > o Black 4 ® ¢ » American Indian 4 © o Pacific Islander - i al D Veteran 5 e ° « Distance q o q Congestive Heart Failure 4 © te o Cardiac Arrhythmias 5 q a » Valvular Disease 4 ® i ° Pulmonary Circulation Disorder 5 @ Is ( Peripheral Vascular Disorder 4 » Hypertension Uncomplicated 4 ® q « Hypertension Complicated 4 ® 2 r Paralysis 4 s Other Neurological Disorder 4 ® Chronic Pulmonary Disease 4 ® Diabetes Uncomplicated - ® 4 Diabetes Complicated 5 « ' hyroid 5 @ Renal Failure ) ® D> Liver Disease 4 ® ® C Peptic Ulcer Disease - * 5 « S HIV 4 s i? o Lymphoma - el Metastatic Cancer s Tumor Without Metastatis 7 © Rheumatoid Arthritis 4 ® Coagulopathy y _ Obesity 5 © Weight Loss - ® Fluid Electrolyte Disorders ® Blood Loss Anemia S Deficiency Anemia 4 » Alcohol Abuse + ® Drug Abuse - ® Psychoses 4 Depression 7 g Previous Pneumonia (/10) 7 ® Previous Year Outpatient Visits 4 J Previous Year Inpatient Visits 5 é Previous Year ED Visits 5 o Systolic Blood Pressure + o Diastolic Blood Pressure > Pulse 4 o Pain - O2 Saturation (/10) 4 ® Respiratory Rate 4 Temperature (/10) 4 ® Fever (/10) 4 o Supplemental O2 Provided 5 ce Flow Rate of Supplemental O2 7 > Concentration of Supplemental O2 + © White Blood Cell Count 5 q Urgent Request 5 ie ° Reg. Phys. X-Ray Count > ' Phys. Pred. Diagnosis > Median 4 ' ° Phys. Pred. False Neg. > Median - ® Phys. Diagnosis Propensity > o Phys. False Neg. Propensity 4 q Phys. Fraction of Urgent Requests 7 » X-Ray with Multiple Views 4 ® a" hs oa 2 ot 2-2-o-t-s T T T T T T T T T -6 -3 0 3 6 -2 -1 0 A 2 -2 -1 0 A 2 F-Stat: 608.20. p-value: 0.000 F-Stat: 2.28. p-value: 0.000 F-Stat: 1.40. p-value: 0.015 Figure IV Covariate Balance Note: This figure shows coefficients and 95% confidence intervals from regressions of diagnosis status d; (left column) or the assigned radiologist's leave-out diagnosis propensity Z; (middle and right columns, defined in Equation (4)) on covariates X;, controlling for time-station interactions T;. The 66 covariates are the variables listed in Appendix A.2, less the 11 variables that are indicators for missing values. The left and middle panels use the full sample of stations. The right panel uses 44 stations with balance on age, defined in Section 4.2. The outcome variables are multiplied by 100. Continuous covariates are standardized so that they have standard deviations equal to 1. For readability, a few coefficients (and their standard errors) are divided by 10, as indicated by "/10" in the covariate labels. At the bottom of each panel, we report the F-statistic and p-value from the joint F-test of all covariates. 43 1.00 4 a= 0.75 4 TPR, S a o 1 True positive rate ( oS nN a 1 0.00 4 0.00 0.25 0.50 0.75 1.00 -_-_- False positive rate (FPR)) Figure V Projecting Data on ROC Space Note: This figure plots the true positive rate (TPR;) and false positive rate (F PR;) for each radiologist across the 3,199 radiologists in our sample who have at least 100 chest X-rays. The figure is based on observed ~ --ob risk-adjusted diagnosis and miss rates pobs and FN; * then adjusted for the share of X-rays not at risk for pneumonia (* = 0.336) and the share of cases in which pneumonia first manifests after the initial visit (A= 0.026). The values of TPR; and FPR; are then computed using the estimated prevalence rate S$ = 0.051. Values are truncated to impose TPR, < 1 (affects 597 observations), FPR; > 0 (affects 44 observations), and TPR; > FPR '; (affects 68 observations). See Section 4.3 and Appendix C for more details. 44 A: Full Sample .03 5 025 5 oO 2 @ oD oO Cc ®o a oO uw .02+ Coeff = 0.291 (0.031) .015- N = 4,663,840, J = 3,199 06 07 08 Diagnosis rate B: Stations with Balance on Age .0275 e .025- oO 2 w 3 0225+ ®o 2 oO LL .02 5 Coeff = 0.344 (0.062) .0175- N = 1,464,642, J = 1,094 065 07 075 08 085 Diagnosis rate Figure VI Diagnosis and Miss Rates Note: This figure plots the relationship between miss rates and diagnosis rates across radiologists, using the leave-out diagnosis propensity instrument Z;, defined in Equation (4). We first estimate the first-stage regres- sion of diagnosis d; on Z; controlling for covariates X; and minimal controls T;. We then plot a binned scatter of the indicator of a false negative m; against the fitted first-stage values, residualizing both with respect to X; and T;, and recentering both to their respective sample means. Panel A shows results for the full sample. Panel B shows results in the subsample comprising 44 stations with balance on age, as defined in Section 4.2. The coefficient in each panel corresponds to the 2SLS estimate for the corresponding IV regression, as well as the number of cases (N) and the number of radiologists (J). The standard error is clustered at the radiologist level and shown in parentheses. 45 A: Full Sample .03 5 @o 2 o 2 .0254 Cc e oa a & e 3 eee 3 .02- 6 Oo © x .015- Coeff = 0.096 (0.006) N = 4,663,840, J = 3,199 05 06 07 08 09 Diagnosis rate B: Stations with Balance on Age .0275- oO .025- 2 @ mo o c ®o 2 .0225- «alate - ee oe Oo 2 Ss Oo o o .02 5 Coeff = 0.019 (0.008) 0175- N = 1,464,642, J = 1,094 .065 .07 .075 .08 .085 Diagnosis rate Figure VII Balance on Predicted False Negative Note: This figure plots the relationship between radiologist diagnosis rates and predicted false negatives of patients assigned to radiologists, using the leave-out diagnosis propensity instrument Z;. Plots are generated analogously to those in Figure VI, except that the false negative indicator m; is replaced by the predicted value m,; from a regression of m; on X; alone and controls X; are omitted. Panel A shows results for the full sample. Panel B shows results in the subsample comprising 44 stations with balance on age, as defined in Section 4.2. The coefficient in each panel corresponds to the 2SLS estimate for the corresponding IV regression, as well as the number of cases (V) and the number of radiologists (J). The standard error is clustered at the radiologist level and shown in parentheses. 46 A: Full Sample 325 oO a Qf nee Bo) rere eecese eee eece eee eeeeceeeeeeeeeeeees 245 27 0 2 4 6 8 10 Number of patient characteristic sets B: Stations with Balance on Age 47 36 5 oO Qa 2 wo 327 .28 5 0 2 4 6 8 10 Number of patient characteristic sets Figure VIII Stability of Slope between Diagnosis and Miss Rates Note: This figure shows the stability of the IV estimate of Figure VI as we vary the set of patient characteristics we use as controls. We divide the 77 variables in X; into 10 subsets as described in Section 4.4 and re-run the IV regression of Figure VI using each of the 2!° = 1,024 different combinations of the subsets in place of X;. The x-axis reports the number of subsets. The y-axis shows the average slope as a solid line and the minimum and maximum slopes as dashed lines. Panel A shows results in the full sample of stations; Panel B shows results in the subsample comprising 44 stations with balance on age, as defined in Section 4.2. 47 2.004 1.754 e 1.504 1.254 1.00 + 0.4 0.6 0.8 1.0 Qa Figure IX Optimal Diagnostic Threshold Note: This figure shows how the optimal diagnostic threshold varies as a function of skill @ and preferences B with iso-preference curves for 6 € {5,7,9}. Each iso-preference curve illustrates how the optimal diagnostic threshold varies with the evaluation skill for a fixed preference, given by Equation (7), using ¥ =1.635 estimated from the model. Dots on the figure represent the empirical Bayes posterior mean of @ (on the x-axis) and 7 (on the y-axis) for each radiologist. The empirical Bayes posterior means are the same as those shown in Appendix Figure A.13. Details on the empirical Bayes procedure are given in Appendix E.3. 48 Table I Structural Estimation Results Panel A: Model Parameter Estimates Estimate Description BL 0.945 Mean of &;, a; = 4 (1+ tanha; @ (0.219) pr A= 7 ( i) Ca 0.296 Standard deviation of @; (0.029) Lt 1.895 Mean of £,, B; = exp; 'B (0.249) J Pi J OB 0.136 Standard deviation of f; (0.044) A 0.026 Share of at-risk negatives developing subsequent pneumonia (0.001) v 1.635 Prevalence 5 = 1-®(y) (0.091) K 0.336 Share not at risk for pneumonia Panel B: Radiologist Posterior Means Percentiles Mean 10th 25th 75th 90th a 0.855 0.756 0.816 0.908 0.934 (0.050) (0.079) (0.065) (0.035) (0.025) B 6.713 5.596 6.071 7.284 7.909 (1.694) (1.608) (1.659) (1.750) (1.780) T 1.252 1.165 1.208 1.298 1.336 (0.006) (0.009) (0.006) (0.008) (0.012) Note: This table shows model parameter estimates (Panel A) and moments in the implied distribution of em- pirical Bayes posterior means across radiologists (Panel B). 4g and og determine the distribution of radiologist diagnostic skill a, and yg and og determine the distribution of radiologist preferences £ (the disutility of a false negative relative to a false positive). We assume that a and f are uncorrelated. A is the proportion of at-risk chest X-rays with no radiographic pneumonia at the time of exam but subsequent development of pneumonia. y describes the prevalence of pneumonia at the time of the exam among at-risk chest X-rays. x is the proportion of chest X-rays not at risk for pneumonia. It is calibrated as the proportion of patients with predicted proba- bility of pneumonia less than 0.01 from a random forest model of pneumonia based on rich characteristics in the patient chart. Parameters are described in further detail in Sections 5.1 and 5.2. The method to calculate empirical Bayes posterior means is described in Appendix E.3. Standard errors, shown in parentheses, are computed by block bootstrap, with replacement, at the radiologist level. 49 "[2AQ] ISISOTOIPeI oq} ye 'JUstUSOR]dar YIM 'densjoog yoorq Aq poynduios ore 'sosoyjuored Ul UMOYS 'sIOIe prepurj|g "oATeoor Ady) sTeUsIS JUapusdapur (posse) oy} oUIqUIOD pue Juaned o[8UIs B aSOUSeIp 0} S}SISOTOIPeI OM} SMOTTL 9 Temorysoyunos) 'Ajoanoedsal 'sopyusosed IG, pue 'YINS 'WISZ ON) 0} [[FIS oHsouserp saosdual ¢ 0} ¢ STeNjoRyIOWUNOZ '][PF[s NsoUseIp Jo UOTOUN & sv soyeI sIsOUseIp sosodult 7 [enjoepiojuno 's}siSoporpes [[e JoJ ae1 stsouselp poxy ev sosodut | [emMovjroyuno; :sploysssy) oNsouserp ssodun 7 pue | sjemovjroyuno; 'sororod [emovjJa}UNOD Jopun sJeJ[oM pu sotOoINO MOUS SMOI JUsNbesqns 'onb snye}s st} JepuN IeJ[OM pue soUIOoINO SMOYsS MOI SIG SY, "onb snjeys oy} Jopun ueyy Aotjod [EMORJIO}UNOD ST) JapuN JUOIOPIIP St yey) (JOU JO pasoUseIp "O°l) UONROYISsE[O B YJIM SSO] oe sosBd PoYisse[soy "elUOUNOUd Jo soUDTeAId oy) AQ POPIAIp [[e ae Poyisse[oor pu 'posouserp 'soarisod os]ej 'soayesou Vs[eJ OIE Jey) SOSBD JO SIOQUINN] 'soUTODJNO saMIsOd asTey JO SANeSOU OS|eJ OU JO 3S9q ISI OY) JOJ | pue onb smye}s OY) 1OJ Q 0} POZI[PULIOU SI OIRJ[OAA "9 UOTIS Ul poquosoap JoYNy 'sororod jenjoejioyunod pue onb snyejs oy} JopuN aJeJ[AM PUL SOUIODINO SMOYS 91GR} SIU, -asoNy (SLr'0) 19v'T I 0 0 I 489q ISL "L (prT'0) (6S€'0) (6LE'0) (7Z0'0) (pZ0'0) OLV'0 6c8'T Lv6'0 8010 8re'0 s[PUSIS OM) SUIqQUIOD *9 (6110) (S8€'0) (LOv'0) (9Z0'0) (pE0'0) 9vE0 ve6'T 6r0'T ScT0 £970 spHusoIed ICL 01 [TEAS eaorduyy *¢ (6S0°0) (LIv'0) (Srr'0) (€€0'0) (LZ0°0) r8T0 910°C 69T'T esTO velo g[usoIed MOS 0} [[F{S eaosduyy "y (€Z0°0) (1Zv'0) (ssr'0) (6€0'0) (910°0) €L0°0 y90°C 6£7'T SLO 6S0°0 g[Husosed wPCZ 0} [[F{S eaosduyy "¢ (9¢7'0) (TOT'0) (LST'0) (080'0) (0Z0°0) 9c1T'0 080°C ILeT c6rl0 v00'0 TPIS JO UoRSuny se proysosy yy, *Z (bZZ'0) (€1T'0) (LLT'0) ($L0°0) (S10'0) €61°0 ce0'C Cec T 0070 €00°0- PLOYSoIy} Pox *T (€0r'0) (6€r'0) (Z¥0"0) 0 vLO'C 897e'T v6r'0 0 onb snyeig 0 poyissepey § posouseiq dATIISOg dABSON OICTIOM, Aoog osTey ose SOToTpOg [enjovjsoyuno;, LAB 50 Online Appendix for "Selection with Variation in Diagnostic Skill: Evidence from Radiologists" David C. Chan Matthew Gentzkow Chuan Yu September 2021 Monotonicity Conditions ...........-...- 2.0200 2 eee eee ee A.2 Identification of Preferences ..........0. 0.0. eee ee et te eee A3 Mapping Data to ROC Space... ....... 0... eee ee ee ee ees Ad Tests of Monotonicity .... 0... ee ee ees A.6 Details of Structural Analysis... ..........0..-. 0000222 e eee eens AS E.1 Optimal Diagnostic Thresholds... 1... 2... 02. ee ee ee es Ad E.2 Simulated Maximum Likelihood Estimation... ...........2. 2002 eae A.13 E.3 Empirical Bayes Posterior Means... . 1... ee ns A.14 Robustness ... 2.2.2.2... ee A.14 Extensions 2... 1. ee ee eee A.17 G.1 General Loss for False Negatives... 1... ee ee ee A.17 G.2 Incorrect Beliefs... 2... ee A.23 G.3 Simulation of Linear Risk Adjustment ............. 02.00 eee euee A.24 G.4 Controlling for Radiologist Skil] 2... 1. ee ee A.25 A.l A Monotonicity Conditions We begin with the covariance object of interest under average monotonicity of Frandsen et al. (2019) (Condition 2). For a given case i and set of agents .f, define Wig = > ,pi(Pi-P) (dis -4i), JETS where p; is the share of cases assigned to agent j, P=> ; P;P; is the p-weighted average treatment propensity, and d; = 5; j Pj di is the p-weighted average potential treatment of case i. To consider probabilistic monotonicity (Condition 3), which allows d;; to be random, we consider the probability limit of 'Y;,7 over random draws of dj;, as the number of draws grows large: Fi. = >); (Py-P) (Pr(dy =1)-B[ai]), JES where E [a.| = je; Pr (dj = 1). Proposition A.1. Probabilistic monotonicity (Condition 3) in some set of agents J implies Ye, g2z9 for alli. Proof. Under probabilistic monotonicity, for any j and j', P; > Pj implies that Pr (dj; = 1) > Pr (dij = 1) for all i. Thus, any (p-weighted) covariance between P; and Pr (d;; = 1) must be weakly positive for all i, in any set of agents J where probabilistic monotonicity holds. WY, g is in fact the p-weighted covariance between P; and Pr (d,; = 1) for a given i, so ¥;, 7 > 0 for all i. Oo To analyze the implications of skill-propensity independence (Condition 4), we define the limit as the number of agents grows large. We assume that when the set of agents is J, the skill a;, diagnosis rate P;, an assignment weight ¢; such that pj = ¢j/Dieg Sj, and any other decision- relevant characteristics of each agent j € J are drawn independently from a distribution H. For a case i, let G denote the distribution of (a;q)s Pxi)) incorporating the uncertainty from both the draws from H and the assignment process. Skill-propensity independence (Condition 4) implies that aj) and Pj) are independent under G. We let 7; (a, p) denote the probability that the case is diagnosed conditional on the assigned agent's' skill a and diagnosis rate p, and 7;(p) denote the probability conditional only on p. Probabilistic monotonicity (Condition 3) implies that 7; (a, p) is increasing in p. Let P; denote the probability limit of W;, 7 as the number of agents in J grows large. Proposition A.2. Skill-propensity independence (Condition 4) implies ¥; > 0 for all i. Proof. Note that under skill-propensity independence we can write G (a, p) = Ga (@) Gp (p), where Ga and Gp are the marginal distributions of p and a. By the law of large numbers, the probability limit A.2 W; is the expectation under the joint distribution G: Y; = Eg (2 - P) (7: (a, p)- ai) . Moreover, Eg |(p-P) (m(a,p)-4i) | [ (p-?) x: (a,p)dG (ap) (p-P) mi (p)dGp(p) i} ieee IV The first equality uses the fact that Eg (PB; -P) d,| = 0, the second equality uses skill-propensity independence, and the final inequality uses P = Eg [P;] and the fact that 7; (qa, p) increasing in p implies 7; (p) increasing in p. Oo B Identification of Preferences Proposition B.3. If the posterior probability of s; = 1 is continuously increasing in w;; for any signal, ROC curves must be smooth and concave. Proof. Without loss of generality, consider a uniform signal w ~ U (0,1). Then under the threshold rule noted in Section 2.1, P; = 1-7;. Furthermore, 1 1 TPR; = 5 [_,, Pels= tones) aw J 1 1 FPR; = si! 1-Pr(s =1|w,a;) dw. 1-S Jy_p, f 1=S Pr( s=1|1-P;,a;) S 1-Pr(s=1|1-P;,a;) is increasing in w. Oo This implies a slope in ROC space o at P;, which is decreasing in P; if Pr(s = 1|w,a;) eas . . . we (1,1)-14; (0,1 Proposition B.4. Knowing the cost of a false negative relative to a false positive, Bj = see € (0, co), is sufficient to identify the function u; (-,-) up to normalizations. Proof. The agent's expected loss from choosing d = 1 rather than d = 0 is E[u(l,s)-u(0,s)|w,a@] = [u(, 1)- 4 (0, 1)] Pr(s = 1| w,@) + [wu (1, 0) - 4 (0,0)] Pr(s = 0] w, a). The optimal decision is thus d = 1 if and only if u(1,1)-u(0,1) 5 Pr(s =0|w,a@) u(0,0)-u(1,0) ~ Pr(s =1|w,a) A.3 C Mapping Data to ROC Space In this appendix, we detail parameters that map the observed data on diagnoses (d;) and false negatives (m;) for each patient to the key objects of the true positive rate (TPR;) and the false positive rate (FPR;) for each radiologist in ROC space. As discussed in Section 4.1, this mapping requires a parameter for the prevalence of pneumonia, or S = 1-@(y). Under quasi-random assignment, this prevalence of pneumonia is (conditionally) the same across radiologists. In addition, we allow for two additional parameters to address practical concerns. First, some chest X-rays are ordered for reasons completely unrelated to pneumonia (e.g., rib fractures). We thus consider a proportion of cases x that are not at risk for pneumonia and are recognized as such by all radiologists. Second, we do not observe false negatives immediately at the same time that the chest X-ray is read. So we allow for a share 2 of undiagnosed cases that do not have pneumonia to develop it and be diagnosed subsequently, thus being incorrectly observed as false negatives. We begin with the observed radiologist-specific diagnosis and miss rates pots and F Nes, which ~ -~ob: are population values of the estimates Pos and F Nj " defined in the main text. They relate to true shares FN;, TN;, FP;, and TP; as follows: obs Pj obs FN; (1-x) (TP; + FP;) =(1-«)P;; (C.1) (1-x)(FN;+ATN;). (C.2) Using Equations (C.1) and (C.2) above and the fact that TN; = 1- P; - FN;, we derive APS + FNS ; 7 _ (C.3) FNi= Gopjd-a1-a' We can derive the remaining shares by using TN; = 1-P;-FN;, TP; =S-FN;, and FP; = P;-TP;: obs 'obs 1 Pe + FN; TN; = -->- J 1-a (1-x«)(1-A) APS + FNS TP; = S- ae A ; (l-x)(1-a) 1-a obs obs rp. Py+ FNS A 5 J (l-«)(1-a) 1-a The underlying true positive rates and false positive rates are thus b b J TP; +FN; S\(-x«(1-a) 1-a rer, =< -ffPi___} pobs+ FNP 4 5 J FP;+TN; 1-S\(i-«)(1-Aa) 1-a , A.4 Conditional on S$, x, and A, we can thus transform data for a given radiologist in reduced-form space to the relevant radiologist-specific rates in ROC space: (pots, rivers) =", (FPR),TPR)). ~,. --0b In Figure V, we show the implied (F PR;,TPR;) based on ( probs, F Ne ' and model estimates of S, x, and A. This figure does not account for the fact that (Pers, FN; ") are measured in finite sample, and we simply impose that TPR; < 1, FPR; = 0, and TPR; = FPR;, sequentially. The first step of TPR; < 1 truncates 597 out of 3,199 radiologists (or 18.7% of radiologists), which mainly comes from the radiologists whose observed miss rate, FN', is smaller than A. The second step of FPR; = 0 truncates 44 radiologists. The third step of TPR; = FPR; truncates 68 radiologists. In Appendix Figure A.14, we plot empirical Bayes posterior means of (FPR;,T PR; 7) based on (Fees, FN FN;") and all estimated model parameters. While ROC-space radiologist rates depend on S, x, and A, it is important to note that two key findings are invariant to these parameters. First, Figure VI and Appendix Figure A.9 imply an upward- sloping relationship between pbs and F Nees, By Equations (C.1) and (C.3), we can see that this violates the prediction that A € [-1,0], based on P; and FN;. Specifically, comparing two radiologists j and 7', Equations (C.1) and (C.3) imply that F 'obs _ Fes J Nj Papa pobs -Aeé[-1,-A]. Jj =(1- ane a So the coefficient estimand A°' > 0 from a regression of F Nos on pos implies that A > 0 for any aé€[0,1). Second, by Remark 2, an upward sloping relationship between P; and FN; contradicts uniform skill regardless of S. Therefore, regardless of S, the pattern of (F PR;, TPR;) across radiologists in ROC space, as in Figure V, should remain downward-sloping and inconsistent with the assumption of uniform skill.! To illustrate the second point, we show in Appendix Figure A.6 that the pattern of (F PR;, TPR;) across radiologists remains inconsistent with uniform skill, at lower and upper bounds for S. To construct these bounds, we first divide all radiologists into ten bins based on their diagnosed shares P,. For each bin qg, we set a lower bound for S at the weighted-average (underlying) miss rate, _ Bar x; JEIq nj FN N; or So = FN qm = SS that all diagnoses are false positives. We set an upper bound for S at the weighted-average sum of LjeIq Mi (FN, +P;) LieTg i take the intersection of these bounds from all bins as the bounds in the full sample, which gives us , where Jq is the set of agents in bin g. In other words, we assume the (underlying) miss rate and diagnosis rate, or Sq = FNg +P, = . Finally, we 1Consider two agents j and j'. Let ATPR = TPR; -TPR;, AFPR = FPR; - FPR;, AP = P; - Pj; and AFN = FN; - F Nj. It is easy to show that ATPR = -4AFN and AFPR = 71, (AP+AFN). So SEER = -LS AEN. The condition that 455" AFI N €(-1,0) is equivalent to the condition that ALPR > 0, as long as S € (0, 1). A.5 S = max} <q<10S, = 0.015 and S = mini <g<i0 Sg = 0.073. Further, as we discuss in Section 4.4, our overall results remain robust to alternative values for x. As shown in Appendix Table A.10, model parameters are stable and suggest wide variation in diagnostic skill. Model implications for reducing variation by uniform preferences or uniform skill similarly remain robust. D_ Tests of Monotonicity Under the standard monotonicity assumption (Condition 1(iii)), when comparing a radiologist j' who diagnoses more cases than radiologist j, there cannot be a case i such that dj; = 1 and dj; = 0. In this appendix, we conduct informal tests of this assumption that are standard in the judges-design literature, along the lines of tests in Bhuller et al. (2020) and Dobbie et al. (2018). These monotonicity tests confirm whether the first-stage estimates are non-negative in subsamples of cases. We first present results of implementing these standard tests. We then draw relationships between these tests, which do not reject monotonicity, and our analysis in Section 4, which strongly rejects monotonicity. Results We define subsamples of cases based on patient characteristics. We consider four characteristics: probability of diagnosis (based on patient characteristics), age, arrival time, and race. We define two subsamples for each of the characteristics, for a total of eight subsamples: (i) above-median age, (ii) below-median age, (iii) above-median probability of diagnosis, (iv) below-median probability of diagnosis, (v) arrival time during the day (between 7 a.m. and 7 p.m.), (vi) arrival time at night (between 7 p.m. and 7 a.m.), (vii) white race, and (viii) non-white race. The first testable implication follows from the following intuition: Under monotonicity, a radiol- ogist who generally increases the probability of diagnosis should increase the probability of diagnosis in any subsample of cases. Following the judges-design literature, we construct leave-out propensi- ties for pneumonia diagnosis and use these propensities as instruments for whether an index case is diagnosed with pneumonia, as in Equation (4). In each of the eight subsamples indexed by 7, we estimate the following first-stage regression, using observations in subsample Z,: dj =Q,r Zi(i) + Xj7, + Tinr + &}. (D.4) Consistent with our quasi-experiment in Assumption 1, we control for time categories interacted with station identities, or T;. We also control for patient characteristics X;, as in our baseline first-stage regression. Under monotonicity, we should have a; > 0 for all r. The second testable implication is slightly stronger: Under monotonicity, an increase in the prob- ability of diagnosis by changing radiologists in any subsample of patients should correspond to in- creases in the probability of diagnosis in all other subsamples of patients. To capture this intuition, A.6 we construct "reverse-sample" instruments that exclude any case in subsample r: 1 ' |G\L| » di, iel;\ I, We estimate the first-stage regression, using observations in subsample J,: d; = a Zi) + Xin, + Tiny + &. (D.5) As before, we control for patient characteristics X; and time categories interacted with station dum- mies T;, and we check whether a, > 0 for all r. In Appendix Table A.6, we show results for these informal monotonicity tests, based on Equations (D.4) and (D.5). Panel A shows results corresponding to the standard leave-out instrument, or a, from the Equation (D.4). Panel B shows results corresponding to the reverse-sample instrument, or a, from Equation (D.5). Each column corresponds to a different subsample. All 16 regressions yield strongly positive first-stage coefficients. Relationship with Reduced-Form Analysis At a high level, the informal tests of monotonicity in the judges-design literature use information about observable case characteristics and treatment decisions, while our analysis in Section 4 exploits additional information about outcomes tied to an underlying state that is relevant for the classification decision. In this subsection, we will clarify the relationship between these analyses. We begin with the standard condition for IV validity, Condition 1. Following Imbens and Angrist (1994), we abstract from covariates, assuming unconditional random assignment in Condition 1(i), and consider a discrete multivalued instrument Z;. In the judges design, the instrument can be thought of as the agent's treatment propensity, or Z; = Pj € {P1, Po, ..-» PK}, which the leave-out instrument approaches with infinite data. We assume that p; < pz <--: < px. We also introduce the notation d;(Z;) € {0,1} to denote potential treatment decisions as a function of the instrument; in our main framework, this amounts to d;; = d;(p) for all j such that P; = p. Now consider some binary characteristic x; € {0,1}. We first note that the following Wald esti- mand between two consecutive values pz; and p;+; of the instrument characterizes the probability that x; = 1 among compliers i such that d; (px41) > d; (px): E | xidj|Z = Pr+i] -E[xid,|Z; = pr] =E|x:|d; >d . E[d,| Z; = peat] --E[d;| Z; = pe] [xj] di (PK+1) > 4: (Dx)] Since x; is binary, this Wald estimand gives us Pr(x; = 1| d; (px+1) > di (px)) € [0, 1]. Under Imbens and Angrist (1994), 2SLS of x;d; as an "outcome variable," instrumenting d; with all values of Z;, will give us a weighted average of the Wald estimands over k € {1,..., K - 1}. Specif- A.7 ically, consider the following equations: xidj = A*d; + ur; (D.6) d; a*Z; + VF. (D.7) The 2SLS estimator of A* in this set of equations should converge to a weighted average: K-1 A* = » Og Pr( xi = 1] 4; (D+1) > di Dr), k=l where weights Q, are positive and sum to 1. Therefore, we would expect that A* € [0, 1]. The informal monotonicity tests we conducted above ask whether some weighted average of Pr(d; (px+1) > 4; (px)| x; = 1) is greater than 0. Since Pr(x; = 1) > 0 and Pr(d; (px+1) > d; (pe) > 0, the two conditions-Pr ( d; (px+1) > dj (px)| xj = 1) > O and Pr( xj = 1|d; (px+1) > dj (px)) > O-are equivalent. Therefore, if we were to estimate Equations (D.6) and (D.7) by 2SLS, we would in essence be evaluating the same implication as the informal monotonicity tests standard in the literature. In contrast, in a stylized representation of Section 4, we are performing 2SLS on the following equations: Ad; + uj; (D.8) aZj,+V;. (D.9) mi dj Recall that m; = 1(d; = 0,5; = 1) = s;(1-d;). Following the same reasoning above, we can state the estimand A as follows: a A=- » Og Pr(s; = 1| di (Pe1) > di Bx)), k=1 which is a negative weighted average of conditional probabilities. This yields the same prediction that we stated in Remark 3 (i.e., A € [-1,0]). As we discuss in Section 2.3, weaker conditions of monotonicity would leave this prediction unchanged. More generally, we could apply the same reasoning to any binary potential outcome y; (d) € {0,1} under treatment choice d € {0,1}. It is straightforward to show that, if we replace m; with y;d; in Equation (D.8), the 2SLS system of Equations (D.8) and (D.9) would yield a 1 A= , Ox Pr(yi (1) = 1] di (Pe41) > di (Vx) € [0,1]. i > ll Alternatively, replacing m; with -y; (1 - d;) in Equation (D.8) would imply a -1 A= ) OQ, Pr(y; (0) = 1] 4; (De+1) > di (pK) € [0,1]. Pa ll _ A.8 How might we interpret our results together in Section 4 and in this appendix? We show above that the informal monotonicity tests are necessary for demonstrating that binary observable characteristics have admissible probabilities (i.e., Pr(x; = 1) € [0,1]) among compliers. On the other hand, our analysis in Section 4 strongly rejects that the key underlying state s; has admissible probabilities among compliers. Specifically, our finding that A ¢ [-1,0] is equivalent to showing that Pr(s; = 1) ¢ [0,1] among compliers, weighted by the probability that they contribute to the LATE. Observable characteristics may be correlated with s;, but s; is undoubtedly related to characteristics that are unobservable to the econometrician but, importantly, observable to radiologists. The importance of these unobservable characteristics will drive the difference between our analysis and the standard informal tests for monotonicity. If monotonicity violations are more likely to occur between cases based on an underlying state than they to occur between cases based on observable characteristics, as would be plausible in clas- sification decisions with variation in skill, then an analysis based on the underlying state should be stronger than an analysis based only on observable characteristics. Finally, we note in Section 2.3 that our analysis in Section 4 is strongly connected to the concep- tual intuition for testing IV validity described in Kitagawa (2015). Kitagawa (2015) shows that with data on treatment d;, outcome y;, and instrument Z;, the strongest testable implication of IV validity is that potential outcomes should have positive density among compliers. Kitagawa (2015) and Mou- rifie and Wan (2017) extend this intuition when we also have access to some observable characteristic x;. In this case, the implication of IV validity can be strengthened to requiring potential outcomes to have positive density among compliers within each bin of x;. Thus, to implement a stronger test of IV validity (including monotonicity), we could undertake a similar test of A € [-1,0] using observations within each bin of x;. E_ Details of Structural Analysis E.1 Optimal Diagnostic Thresholds We provide a derivation of the optimal diagnostic threshold, given by Equation (7) in Section 5.1. We start with a general expression for the joint distribution of the latent index for each patient, or y;, and radiologist signals, or w;;. These signals determine each patient's true disease status and diagnosis status: 1(y; >); Si dij 1 (wij > Tj) . We then form expectations of unconditional rates of false positives and false negatives, or FP; = Pr (dj; = 1,5; = 0) and FN; = Pr (dj; = 0,5; = 1), respectively. Consider the radiologist-specific joint Ad distribution of (wi;, vi) as fj (x,y). Then Tj +00 FN; = Pr (wiy < Tj, Vj >) -[ [ fy) dydx; -00 JV +00 "V FP; = Pr (wij > Tj, Vi <7) -[ / fj (x,y) dydx. Tj -oo The joint distribution f; (x, y) and V are known to the radiologist. Given her expected utility function in Equation (6), E [ujj] = - (FP; + B;FN)), where £; is the disutility of a false negative relative to a false positive, the radiologist sets 1; to maximize her expected utility. The first order condition from expected utility is OFP; | OFN; OT 7 J Or 7 Denote the marginal density of w;; as g;. Denote the conditional density of v; given w;; as fj (y|x) = " a and the conditional cumulative distribution as F; (y|x) = ["., fj (tx) dt. Then solving this first order condition for the optimal threshold yields OFP; | OFN; OT; J OT; / Fi (tj) dy - B; [ Jj (ty) dy CO [soins (a) dy-B; [fi (vinda (z;) dy F; (| 7) 8; (tj) - By (1- Fj (Vl t)) 8; (tj) =0. The solution to the first order condition 7; satisfies F,(¥ir}) = a 10) Equation (E.10) can alternatively be stated as 6 (7153) by =-----. 1-F; (7| Tj ) This condition intuitively states that at the optimal threshold, the likelihood ratio of a false positive over a false negative is equal to the relative disutility of a false negative. As a special case, when (wij, vi) follows a joint-normal distribution, as in Equation (5), we know that v4] wiy ~ N (ajwij,1-a2), or (vai) /af1-03| wij ~ N (,1). This implies that F, (717) = A.10 ® (7 - at!) /jl- a?) . Plugging in Equation (E.10) and rearranging, we obtain Equation (7): _ _if B 7-1-0"! (4) (qf) => Below we verify that 6°E [u;;] /Ot? < 0 at rt in a more general case, so 7; is the optimal threshold that maximizes expected utility. Comparative Statics Returning to the general case, we need to impose a monotone likelihood ratio property to ensure that Equation (E.10) implies a unique solution and to analyze comparative statics. Assumption E.1 (Monotone Likelihood Ratio Property). The joint distribution f; (x,y) satisfies fj (%2,¥2) _ f7 (1, y2) aT ---<{-_, Vx > 1,92 > YJ. fj (%2,y1) fj 1,91) We can rewrite the property using the conditional density: Fj (val x2) > fj (yal *1) Fy (il%2) C1)" Vx2 > X1,92 > Yj. That is, the likelihood ratio f; (y2| x2) / fj (y1| x2), for yo > y1 and any j, always increases with x. In the context of our model, when a higher signal w;; is observed, the likelihood ratio of a higher v; over a lower v; is higher than when a lower w;; is observed. Intuitively, this means that the signal a radiologist receives is informative of the patient's true condition. As a special case, if f(x, y) is a bivariate normal distribution, the monotone likelihood ratio property is equivalent to a positive correlation coefficient. Assumption E.1 implies first-order stochastic dominance. Fixing x2 > x; and considering any y2 > y1, Assumption E.1 implies Fi (y2| 92) F Oi 1) > Fj (val 41) H Ci 2). (E.11) Integrating this expression with respect to y; from -co to y2 yields 7) 2 [foal sould > [ f0aln).fOrlnddy Rearranging, we have fj (yal *2) . F; (y2| x2) v2 frie)" F(yalay ~~ A.ll Sunilarly, integrating Equation (E.11) with respect to yz from y, to © yields +00 +00 Fj (yal x2) (| 41) dy2 > Fj (val x1) G (1 | x2) dy. ¥1 ¥1 Rearranging, we have 1-F;(yi| x2) . fi (y1| 2) ----_-- > --_., Vy. 1-Fj(yla) fila' Combining the two inequalities, we have F; (y| x1) > Fj (y| x2), Vy. (E.12) Under Equation (E.12), for a fixed ¥, F; (¥|t;) decreases with T, ie., OF; (¥|t;) /Ot; <0. We can now verify that OPE [wis] .\ OF; (¥| 7) oe | Oa (s) a5 J THT; rs Therefore, TF represents an optimal threshold that maximizes expected utility. Using Equation (E.12) and the Implicit Function Theorem, we can also derive two reasonable comparative static properties of the optimal threshold. First, 7 decreases with ;: oj (2 (7 a)" <0 . 2 . ' Second, 7; increases with Y: 07; _.) (AF (At) aT J T=; In other words, holding fixed the signal structure, a radiologist will increase her diagnosis rate when the relative disutility of false negatives increases and will decrease her diagnosis rate when pneumonia is less prevalent. We next turn to analyzing the comparative statics of the optimal threshold with respect to skill. For a convenient specification with single-dimensional skill, we return to the specific case of joint-normal (balls }(s )} signals: A.12 Taking the derivative of the optimal threshold with respect to a; in Equation (7), we have * -1 { Bi T 2 ér; © (te) -7 1a; da; a 1 - at These relationships yield the following observations. When a; = 1, Tj =v. When a; = 0, the radiolo- gist diagnoses no one if B; < a (i.e., TF = oo), and the radiologist diagnoses everyone if B; > Eh (i.e., T = -oo). When a; € (0,1), the relationship between Tj and a; depends on the prevalence pa- rameter v. Generally, if 8; is greater than some upper threshold B, T will always increase with a;; if A; is less than some lower threshold £, TF will always decrease with a;; if Bj € (2,2) is in between the lower and upper thresholds, T will first decrease then increase with a;. The thresholds for 8; depend on y: 1-®()' o(7) i) 1-O() J B = min( e) i) Z = max The closer ¥ is to 0, the less space there will be between the thresholds. The range of 8; between the thresholds generally decreases as v decreases. Intuitively, there are two forces that drive the relationship between TF and a;. First, the threshold of radiologists with low skill will depend on the overall prevalence of pneumonia. If pneumonia is uncommon, then radiologists with low skill will tend to diagnose fewer patients; if pneumonia is common, then radiologists with low skill will tend to diagnose more patients. Second, the threshold will depend on the relative disutility of false negatives, 6;. If 6; is high enough, then radiologists with lower skill will tend to diagnose more patients with pneumonia. Depending on the size of £;, this mechanism may not be enough to have T always increasing in a;. E.2 Simulated Maximum Likelihood Estimation In Section 5.2, we estimate the hyperparameter vector @ = (Has Lp, To, op, AV) by maximum likeli- hood: 6 = argmax ) tog [ 2; (ad, a", nj] 7;) f (y;|@) 4y;. j To calculate the radiologist-specific likelihood, &(at.ap.n|o) = [2 (af.ar.nj]y;) f (9:18) arp we need to evaluate the integral numerically. We approximate the integral using multiple-dimensional sparse grids as introduced in Heiss and Winschel (2008), which generates R nodes Y; following the density f (y;| 0), given any hyperparameter vector 8. These nodes are chosen based on Gaussian A.13 quadratures and are assigned weights w' such that >), w" = 1. We use a high accuracy level, which leads to R = 921 nodes in a two-dimensional integral. Then we take the weighted average across all nodes of the likelihood as an approximation of the integral: 2 (afin) = Yow (af ay The overall log-likelihood becomes E.3. Empirical Bayes Posterior Means After estimating 8, we want to find the empirical Bayes posterior mean 7 z= (4; i) for each radiol- ogist j. Using Bayes' theorem, the empirical conditional posterior distribution of y; is S(rnminagem)8) -_ f(aiatsnrs) £ (11) f (ad, 0",;|6) fa (ad ne injlyi) f (v; |) ay; £ (vila, Ag A mm, n30) = where f (id, ni", nj|7;) is equivalent to -Z; (ad, ni", nj|7i); The denominator is then equivalent to the likelihood -Z; (ad ne Ny | 6). The empirical Bayes predictions are the following posterior means: - [vif (ad..nj|71) F (7114) dy; a= [7 f(y] ada ™!,nj30) dy; = >. = . ft (ad.aenjly,) £ (vi18) ar) As above, the integrals are evaluated numerically using sparse grids. We generate R nodes Y; follow- ing the density f (y | 6) and calculate the empirical Bayes posterior means as i) ;) Kwa (man Ew (Ap F Robustness In this appendix, we discuss alternative empirical implementations from the baseline approach. Ap- pendix Table A.8 presents results for the following empirical approaches: 1. Baseline. This column presents results for the baseline empirical approach. This approach uses observations from all stations; the sample selection procedure is given in Appendix Table A.1. We risk-adjust diagnosis and false negative status by 77 patient characteristic variables, A.14 described in Section 4.2, in addition to the controls for time dummies interacted with stations dummies required for plausible quasi-random assignment in Assumption 1. We define a false negative as a case that was not diagnosed initially with pneumonia but returned within 10 days and was diagnosed at that time with pneumonia. . Balanced. This approach modifies the baseline approach by restricting to 44 stations we select in Section 4.2 with stronger evidence for quasi-random assignment. Risk-adjustment and the definition of a false negative are unchanged from baseline. . VA users. This approach restricts attention to a sample of veterans who use VA care more than non-VA care. We identify this sample among dual enrollees in Medicare and the VA. We access both VA and Medicare records of care inside and outside the VA, respectively. We count the number of outpatient, ED, and inpatient visits in the VA and in Medicare, and keep veterans who have more total visits in the VA than in Medicare. The risk-adjustment and outcome definition are unchanged from baseline. . Admission. This approach redefines a false negative to only occur among patients with a greater than 50% predicted chance of admission. Patients with a lower predicted probability of admission are all coded to have m; = 0. The sample selection and risk adjustment are the same as in baseline. . Minimum controls. This approach only controls for time dummies interacted with station dummies, T;, as specified by Assumption 1, without the 77 patient characteristic variables. The sample and outcome definition are unchanged from baseline. . No controls. This approach includes no controls. That is, we bypass the risk-adjustment procedure and use raw counts (x2, nn i) in the likelihood, rather than the risk-adjusted counts J ? ~d = . (as Hy, nj). . Fix A, flexible p. This approach allows for flexible estimation of p in the structural model (whereas we assume that p = 0 in the baseline structural model). Using results from our baseline estimation, we fix A = 0.026 instead. Rationale Relative to the baseline approach, the "balanced" and "minimum controls" approaches respectively evaluate the importance of selecting stations with stronger evidence of quasi-random assignment and of controlling for rich patient observable characteristics. If results are robust under these approaches, then it is less likely that potential non-random assignment could be driving our results. We evaluate results under the "VA users" approach in order to assess the potential threat that false negatives may be unobserved if patients fail to return to the VA. Although the process of returning to the VA is endogenous, it is only a concern under non-random assignment of patients to radiologists or under exclusion violations in which radiologists may influence the likelihood that a patient returns A.15 to the VA, separate of incurring a false negative. Veterans who predominantly use the VA relatively to non-VA options are more likely to return to the VA for unresolved symptoms. Therefore, if results are robust under this approach, then exclusion violations and endogenous return visits are unlikely to explain our Key findings. Similarly, we assess an alternative definition of a false negative in the "admission" approach, requiring that patients are highly likely to be admitted as an inpatient based on their observed charac- teristics. Admitted patients have a built-in pathway for re-evaluation if signs and symptoms persist, worsen, or emerge; they need not decide to return to the VA. This approach also addresses a related threat that fellow ED radiologists may be more reluctant to contradict some radiologists than others, since admitted patients typically receive radiological evaluation from other divisions of radiology. We take the "no controls" approach in order to assess the importance of linear risk-adjustment for our structural results. Although linear risk adjustment may be inconsistent with our nonlinear struc- tural model, we expect that structural results should be qualitatively unchanged if risk-adjustment is relatively unimportant. In "fix A, flexible p," we examine whether our structural model can rationalize the slight negative correlation between a; and £; implied by the data in Appendix Figure A.13. Results Appendix Table A.8 shows the robustness of key results under alternative implementations. Panel A reports sample statistics and reduced-form moments. All empirical implementations result in large variation in diagnosis and miss rates across radiologists. Standard deviations for both rates are weighted by the number of cases. The standard deviation of residual miss rates, after control- ling for radiologist diagnosis rates, reveals that substantial heterogeneity in outcomes remains even after controlling for heterogeneity in decisions. This suggests violations, under all approaches, in the strict version of monotonicity in Condition 1(ii). Most importantly, the IV slope remains similarly positive across approaches. This suggests consistently strong violations in the weaker monotonicity conditions in Conditions 2-4. Panel B of Appendix Table A.8 summarizes policy implications from decomposing variation into skill and preference components, as described in Section 6. In most implementations, more variation in diagnosis can be explained by heterogeneity in skill than by heterogeneity in preferences. An even larger proportion of variation in false negatives can be explained by heterogeneity in skill; essentially none of the variation in false negatives can be explained by heterogeneity in preferences. Appendix Table A.9 shows corresponding structural model results under each of these alternative implementations. Panel A reports parameter estimates, and Panel B reports moments in the distribu- tion of (@;,8;) implied by the model parameters. The implementations again suggest qualitatively similar distributions of a, 8, and T. A.16 G_ Extensions G.1_ General Loss for False Negatives Our baseline specification of utility in Equation (6) considers a fixed loss for any false negative rela- tive to the loss for a false positive. In reality, some cases of pneumonia (e.g., those involving partic- ularly virulent strains or vulnerable patients) may be much more costly to miss. In this appendix, we show that implications are qualitatively unchanged under a more general model with losses for false negatives that may be higher for these more severe cases. We consider the following utility function: -1, if dij = 1s; = 0, uij =)-B;h(y;), if dj; =0,s; = 1, 0, otherwise, where h(v;) is bounded, differentiable, and weakly increasing in v;.2 As before, s; = 1(v; > ¥), and B; > 0. Without loss of generality, we assume A(¥) = 1, so A(v;) > 1, Vvj. Denote the conditional density of v; given w,; as fj (vi wij) and the corresponding conditional cumulative density as F; (¥4| Wi i): Expected utility, conditional on w;; and d;; = 0, is Ey, [wiz (vindij =0)|wiy] = By Ey, [h(1)1 (diz = 0,5; = 1) wig] -B; [ h(i) fi(vilwij dvi. The corresponding expectation when dj; = 1 is -Pr(s; = 0, di; = 1| wiz) v +00 = -/ fioulmpan = f fj(vilwiz)dv; - 1. _ 5 Ey, [ui (vis diy = 1)| wis] The radiologist chooses dj; = 1 if and only if E,, [ ui (vi, dj; = 1)| wij] > By, [wiz (vi, diz = 0)| wiz], or [ (1+ B8;h(%)) fj (vil wiz) dy > 1. Vv 1 If h(v;) = 1 for all v;, then this condition reduces to Pr(v; > ¥| wij) = 1-F; (¥| wi) > Iss In the J general form, if the radiologist is indifferent in diagnosing or not diagnosing, we have 2The boundedness assumption ensures that the integrals below are well-defined. This is a sufficient condition but not necessary. The differentiability assumption simplifies calculation. A.17 = [ (1+ Bjh(4)) f (vel wig) dv; -[ (1483) 6 (val my) avi + [ By (A) - 1) F (vil wiz) av; > (1+ 6;)(1 - Fj(9|wiz)), as we assume h(yv;) > 1. Now the marginal patient may have a lower conditional probability of having pneumonia than the case where h(v;) = 1, Vv;, as false negatives may be more costly. Define the optimal diagnosis rule as dj(wiz) =1 (/ (1+ BhO%)) fjilwiy)dvi > 1]. Proposition G.5 shows conditions under which the optimal diagnosis rule satisfies the threshold cross- ing property. Proposition G.5. Suppose the following two conditions hold: 1. For any w; 7 > Wij, the conditional distribution of v; given Gj first-order dominates (FOSD) the conditional distribution of v; given &;, i.e., Fj(vi|w; 7) < Fj(yi|wiz), Wi, 2. 0 < Fj(¥|wij) < 1, Vwiy- wim, Fi 7 lWis) =1 and wi. Fi 7 lwig) =0. Then the optimal diagnosis rule satisfies the threshold-crossing property, i.e., for any radiologist j, there exists 7; such that _ 0, Wij < TF dj(wij) = . 1, Wij 2 7; . We first prove the following lemma. Lemma G.6. Suppose w; > Wij. If Fj)(ilw; i) < Fj(vj|wiz), for each v;, then d;(wjz) = 1 implies d;(w;,) =1. Proof. Using integration by parts, we have f (1+ B;h(%)) (4 (vilws,) - fj (vilwi)) dy; = [a0 (Filw4) - Fy(vilwi)) dy; = (1+ 8jh(%) (F (vilw{) -F (velwiy)) =~ (1+8)) (F; (Flw{,) - Fj (wi)) - [ Bik' (vi) (Fyoulw,) -Fy(vilwy)) dvi > 0, since F;(vilw;,) < F;(yj|wiz), Vvi, h(;) is bounded, h(¥) = 1, and h'(y;) = 0. We now proceed to the proof of Proposition G.5. Oo A.18 Proof. The second condition of Proposition G.5 ensures that +c0 lim (1 + Bh(vi)) Fi (%ilwij)dvi <(1+ MB,;)(01 - din Fi ol wig)) =0<1; Wij 7-0 y lim (1+ Bjh(1)) GOilwy)dv; = (1+ 6) ~ lina, ili) =1+8;>1, Wij too Vv where M =suph(v;).So_ lim d,(w;;) = 0 and jim d;(w;j) = 1. Using Lemma G.6, the optimal Wij >-0 wij? co diagnosis rule satisfies the threshold-crossing property. In particular, the optimal threshold 7 satisfies [ (1+ Bjh(vi)) Hilt} dv; = 1. Oo Proposition G.7. Suppose the conditions in Proposition G.5 hold and f; is fixed. Then the optimal threshold t; decreases with B;. In particular, t; - +00 as Bj > 0° and t; - -00 as Bj > +00. Proof. Consider radiologists j and j'with B; > Bj. Denote their optimal thresholds as t; and 7;,, respectively. We have ff" (1+ 8;h(i)) Fvilt} dv; = 1 and [C+ 6pmon) Scitejan- [ (1+8)h0) Saleen = Br-B) [hon flulrpydn <0. So fo (1+ Bh) fi(vilt7)dv; < 1, or dy(t7) = 0. By Proposition G.5, we know that 17 < 7}, Since T; decreases with §;, if bounded below or above, it must have limits as 8; approaches +co or 0*. We can confirm that this is not the case. For example, suppose T; is bounded below. The limit exists and is denoted by 7. Take 8; > ROD: Then +00 . 1 _« [ (1+ Byh(vi)) Hilt; dvi = (1+ T-Fop! - F;(¥|7;)) 1 _ - > (1+ 1-Fop - Fj(¥|z)) = 2- Fiz). The second inequality holds since 7 > T. Take the limit and we have +00 lim (1+ Bjh(%i)) Milt} dv; = 2- F;(H|z) > 1. Bjrt Jp This is a contraction, so T is not bounded below. Similarly, we can show 7 is not bounded above. O A.19 From now on, we assume w;; and v; follow a bivariate normal distribution: Ce") Conditional on observing w,;, the true signal v; follows a normal distribution N(a;w;;,1- a). So Vi-Qjwij h-a2 |' 1 a; where ®(-) is the CDF of the standard normal distribution. Fj(yi|wiz) = ® Corollary G.8. Suppose w;; and v; follow the bivariate normal distribution specified above. Then if aj > 0, the optimal diagnosis rule satisfies the threshold-crossing property. Proof. When w;; and v; follow the bivariate normal distribution with the correlation coefficient being Vj - QjWj a;, we have F; (v;| wij) = ® "| itis easy to verify that the two conditions in Proposition 1- as G.5 hold if a; > 0. Define the optimal threshold T = 1;(a;, Bj; h()) by -a, J +00 ae 1+B;h(v,)) eo | 22 | ay; =1, - Bj 5? 1 2 v l-a; where (-) is the density of the standard normal distribution. Oo Corollary G.9. The optimal threshold satisfies where M = sup h(y). Proof. Since h(v;) > 1, we have +co -a:t* 1= [ (1+;h(y,)) -- (2 "an v 1-a% 1-a; elses =(1+6;){1-® i) A.20 > (1+8; Rearrange and we can get the upper bound of Te Similarly, we can derive the lower bound of Tj. The proposition below summarizes the relation between the general case and case where h(yv;) = 1, Vv;. oO Proposition G.10. Let T = T;(a;, Bj; h(-)). Define (vi) : ; dy; pee ie) prof dv; -a J Bi = Bi(a7,Bysh() = B Then we can use the new B; to characterize the optimal threshold: 7; (aj, Bj; h(-)) = T;(a;, Bj; h() = 1). Proof. Let t} = T;(a;,B;;h(-)) and 7;" = 1;(a;, 84; h(-) = 1). Then a [ (1+ £;h()) 1 4" ~8jTF dv; -[~ (1+2°) | ¢ Yi-QjTz dv; =1. ¥ 1-a7 1a} ¥ ] 1-a% Substitute the expression of B; into the second equality and we have i | me Ja +00 " -a} 1 yj-ajT?' [ 1+ B; of 2 | at? 2 400, [ :-0577 1-2 ¢ i J | an j wel eb ve ae ls " . @ (2) dy; =1 a? j oe 6 "ery dy; {-QjT; Y 1-a% Tal = [a+ ashoyo ee Nav =1 a +00 Vi-QyT; - od Z | dy; 1 : mc _» [ treo vi-ayT?" d [ too Vj- AGT; d j= v; , oe an= f(a lay So we have T;" = Tj. oO A.21 Proposition G.11. For fixed B; and h(-), B; = Bi(a;, B;;h(-)) decreases with a. Proof. The optimal threshold 7 = 1;(a;, Bj; h(-)) is given by too 1 Vi QT} [ (1+;h(;)) ai Eom J By Proposition G.10, we can write ronan L(+ Bh) - ne ra) an +00 Vi-QyT; dy; +00 Vi-QyT} no ea) ono Vi bras eneing an fo ean 1-aF B; = Bj = = -1. +00 Vi-Qj +00 Vi-Qj d d oo ego nol ea Vj -Q; 7; Define x; = . Then dy; = ./1-- ar dx;. Using variable transformation, we have 1 a; 1-e? 1 Bi= 7 y= =1. +00 mali V-QjT; [v¢ (: i dy; 1-® J | v 1-a? J1-a? Vi - jT. j Denote Q(v;,a@;,8;) = . For fixed ;, the relationship between B; and a; reduces the relation- 1-a? J ship between Q(¥,a;, 8;) and a;. Using integration by parts for the formula of the optimal threshold, we have oD 1 gfe | a, [ Aly) XL ay, ; =? | ina | dy; = ; (1+ 8;h(%)) By; dy; V j l= [ " (1 + Bjh(%i)) +00 = (1+ Bj h(%i))® -[~ Bjh!(v,)® Fl wale = 1+8;M-(1+B;)®(Q(0,a;,;)) - B; [ h'(v,)P(Q(%, a;, By)) dvi, A.22 where M = sup h(v,). Take the derivative with respect to a, 0Q(¥,a@;, B;) 0 = -(1+8))(Q¥, 03, Bj) - 3 a Ovi, a;, B;) -B; [ h'(vi)6(Q(%,.@;, fy) POenkD ay, (G.13) Qj We want to show that a < 0 for all a; € (0,1). We prove this by contradiction. Assume i 0 y, jo Py 0 3A; ; that for some a € (0,1), we have IQ, 0,83) > 0. Since QV: 05,Bj ) al > 0, OG; laj=a' dajdvy; (1-3/2 OQ(V,a;, Bj) . we know that Fa; increases with v; for any fixed a; € (0,1), in particular for a; = a;. Then d0(V;,0;,B; 0 , 9Q(vin Bi) > 20.0.8) > 0 for any v; = ¥. Since h'(y;) = 0, we have Oa; a@j=0%, 0a; aj =a, OQ(¥, a;, B;) *o° IQ(V;, a i» Bj iF na >, fh) Q(Vis 05,8) - > dj lay-at > 0. 0a; J 7 aj; 7 Then Equation (G.13) cannot hold for a; = 0a; Qi, as the right hand is strictly negative, a contradiction. So, we must have < 0, Va; € (0,1). Therefore, 0 ' pp, AOC a5, By) PEED J ci <0. da; (1 - O(Q(7, aj, B;)))? G.2. Incorrect Beliefs Under the model of radiologist signals implied by Equation (5), we can identify each radiologist's skill a; and her diagnostic threshold 7;. The utility in Equation (6) implies the optimal threshold in Equation (7), as a function of skill a; and preference £;. If radiologists know their skill, then this allows us to infer 8; from a; and 7;. In this appendix, we allow for the possibility that radiologists may be misinformed about their skill: A radiologist may believe she has skill a, even though her true skill is a;. Since only (true) a; and 7; are identified, we cannot separately identify a, and £; from Equation (7). In this exercise, we therefore assume ;, in order to infer ay for each radiologist. We start with our baseline model and form an empirical Bayes posterior mean of (a;,8;) for each radiologist. We use Equation (7) to impute the empirical Bayes posterior mean of t;. Thus, for each radiologist, we have an empirical Bayes posterior mean of (@;,8;,7;) from our baseline model; the distributions of the posterior means for a;, 8;, and tT; are shown in separate panels of Appendix Figure A.13. To extend this analysis to impute each radiologist's belief about her skill, ai, we perform the A.23 following two additional steps: First, we take the mean of the distribution of empirical Bayes posterior means {Bi}; eS? which we calculate as 6.71. Second, we set all radiologists to have 8; = 6.71. We use each radiologist's empirical Bayes posterior mean of t; and the formula for the optimal threshold in Equation (7) to infer her belief about her skill, a. The relationship between a, £;, and t; is shown in Figure IX. As shown in the figure, for fj = 6.71, the comparative statics of 7; are first decreasing and then increasing with a radiologist's perceived a. Thus, holding fixed 6; = 6.71, an observed 1; does not generally imply a single value of a;. If 7; is too low, then there will not be a value of a; to generate tj with B; = 6.71; this case occurs only for a minority of radiologists. Other 7; generally can be consistent with either a value of a on the downward-sloping part of the curve or with a value of as on the upward-sloping part of the curve. In this case, we take the higher value of a, since the vast majority of empirical Bayes posterior means of a; are on the upward-sloping part of Figure IX. Appendix Figure A.19 plots each radiologist's perceived skill, or a, on the y-axis and her actual skill, or @;, on the x-axis. The plot shows that the radiologists' perceptions of their skill generally correlate well with their actual skill, particularly among higher-skilled radiologists. Lower-skilled radiologists, however, tend to over-estimate their skill relative to the truth. G.3 Simulation of Linear Risk Adjustment As described in Section 5.2, we estimate our structural model using moments for each radiologist that are risk-adjusted by linear regressions. An alternative approach would be to explicitly incorporate het- erogeneity in Pr(s; = 1), by station, time, and patient characteristics, into the structural model . While this approach is more consistent with the structural model, it is often computationally prohibitive. In this appendix section, we use Monte Carlo simulations to examine the effectiveness of linear risk adjustment in recovering the underlying structural parameters of our model. Specifically, we fix the set of radiologists at each station and the number of patients that each radiologist examines, or n;, to match the actual data. Assuming that parameter estimates in Table I are the truth, we simulate primitives {a;, B; } ef' independent of n;. We also simulate at-risk patients from a binomial distribution with the probability of being at risk of 1-x. For patients at risk, we simulate their latent index v; and the radiologist-observed signal w;; using a; of the assigned radiologist j. Importantly, in this simulation, we model conditional random as- signment of patients to radiologists within station. For v; and w;; that are jointly normally distributed, [alo s;= 1 (v; > Veq))> as in Equation (5), we have where ¥¢(;) depends on the station €(j) in which radiologist j works. Radiologists know V¢jj). The A.24 optimal threshold is then os pcety ON =e" (5) T* (aj, B;5€())) = a which generates dj; = 1(wij; > T* (a;,8;;€(j))). We finally simulate patients who did not initially have pneumonia but later developed it with 2. Each simulated dataset has the same number of observations as in the original dataset, with four variables for each patient i: the radiologist identifier j, the station identifier £, the diagnosis indicator d; = %1;1G =j@)4djj, and the (observed) false negative indicator m; = 1(d; = 0,5; = 1). We obtain risk-adjusted radiologist moments from the simulated data by regressing diagnosis or false negative indicators on radiologist dummies and station dummies. The key object of confounding risk across groups of observations is the distribution of vz. We assume that this distribution is normal and calibrate its standard deviation based on the following target: the ratio of the standard deviation of unadjusted radiologist diagnosis rates to the standard deviation of adjusted radiologist diagnosis rates. In the actual data, these standard deviations are shown in Appendix Table A.8, as 1.966 and 1.023, respectively. Conceptually, the ratio of these standard deviations captures the net effect of risk adjustment on reduced-form radiologist diagnosis rates. In each of five simulated datasets, we calculate a similar ratio. In our calibration, we aim to match the average of these ratios across the five simulations, holding the random-generating seed fixed in each simulation. In each of the simulations, we redo three sets of results based on unadjusted or adjusted radiol- ogist moments. First, we re-estimate the model parameters. Second, we re-compute counterfactual variation in diagnoses and false negatives when either variation in skill or variation in preferences is eliminated, as described in Section 6.1. Third, we re-compute welfare under policy counterfactuals, as described in Section 6.2. As shown in Appendix Figure A.20, the results of this exercise suggest that linear risk adjustment eliminates most of the bias due to confounding variation in risk across groups of observations. For many estimated parameters and counterfactual results, the bias is almost eliminated by linear risk adjustment. G.4 Controlling for Radiologist Skill Intuitively, monotonicity should hold within bins of skill. In this appendix section, we explore a Monte Carlo proof of concept for whether controlling for agent skill in a judges-design regression can recover complier-weighted treatment effects. Specifically, we simulate data that match our observed data, taking structural estimates as the truth. We then evaluate whether we can recover the complier- weighted "treatment effect," or -Pr(s = 1) in our case, that one should obtain under IV validity when regressing m; on d;, instrumenting d; with Z;. As in Appendix G.3, we take parameter estimates in Table I as the truth and simulate true primi- tives {a;, Bj i es" We similarly fix observations per radiologist and simulate patients at risk. Among A.25 these patients, we simulate v; and w,;;. We determine which patients are diagnosed with pneumonia and which patients are false negatives based on T; (a;, B;), in Equation (7), and v. This implies that, unlike the simulations in Appendix G.3, patients are unconditionally randomly assigned. Finally, we simulate patients who did not initially have pneumonia but later developed it with 2. In the remainder of this appendix section, we will derive the target LATE and then compare whether we can estimate it using various strategies to control for skill. Derivation of the Properly Specified Estimand. The ideal experiment would be to compare radi- ologists with the same a;. However, we have a continuous distribution of a; and a finite number of radiologists. We therefore derive an approximation of the true relationship between F Nps and pos, conditional on skill @;, under a large number of radiologists with the same skill and a large number of patients per radiologist. We then integrate this approximation over the distribution of skill. Specifically, (1 «)Pr (wij >1}) =(1-#)(1-0(x})); (G.14) (1-x«) (Pr( wis < TV > 7a) +APr (Wj < Ts Vi < 7q))), (G.15) P9*s (a;, B;) FNS" (aj, Bj) where t; = T* (a;,B;) in Equation (7). Conditional on a;, there exists a one-to-one mapping in the reduced-form space between F Nps and pos, Conditional on the realization of skill a, we draw J + 1 radiologists with varying 8; from the true distribution and derive their optimal thresholds Tj. We calculate their population diagnosis and miss rates as p; = E[d;|j(@ =f] = pobs (a;,8;) and m; = E[m|j(@) =j]= FN¢s (a;,B;), respectively. We consider the LATE when we use p; as the scalar instrument for diagnosis d;. We rank radiologists based on p; from smallest to largest, so that po < pi <--- < py. From Theorem 2 of Imbens and Angrist (1994), the LATE conditional on skill @ is J A*(@) = )Wj6j,5-1 j=l where (pj - py-1) Xy.; (pi -P) Wo = , Dy <1 (Pm - Pm-1) Dj; Pt(P1-P) 6 =< mm fd Pj - Pj-1 yj is a non-negative weight, which depends on the first-stage difference in diagnosis rates between radiologists and the probability of assignment to j, or pj. 6;,j;-1 is the Wald estimand based on random assignment between j and j -1. Note that p; = (J+ 1)! for all j, by random assignment, and B= 337 Dao?) We then simulate K values of a; from the true distribution to derive the LATE (unconditional on A.26 skill) as i < A*=-) A* . x 2, (ax) We choose reasonably large J = 1,000 and K = 1,000. This can be seen as the approximation of the expectation of the LATE across many realizations of skill. We compute A* = -0.154. Estimation Results. We then estimate the effect of diagnosis d; on the false negative indicator m; and present results in Appendix Table A.11. As in the main text, we estimate this effect by judges- design IV, exploiting the relationship between radiologist diagnosis and miss rates. The standard specification is shown in Column 1 of all panels. Specifically, we perform 2SLS of m; on d;, instrumenting d; by the leave-out diagnosis propensity Z;, given in Equation (4). Since cases are randomly assigned unconditionally in this simulation, we include no further controls. This result is significantly positive, at 0.096, despite the true negative LATE of A* = -0.154. In Panel A, we show results of regressions that control for true skill, a;. For Column 2 of this panel, we control for a; linearly in the 2SLS regression. For Columns 3-6, we divide a; into 5, 10, 20, and 50 bins, respectively, and include indicators for bins of a; as controls in the regression. The results in these columns encompass the true LATE. In Panel B, we show results of similar regressions that replace functions of true skill a; with corresponding functions of the empirical Bayes posterior mean of a;, or @;. Specifically, for Column 2, we control for @; linearly; for Columns 3-6, we divide @; into 5, 10, 20, and 50 bins, respectively, and include indicators for bins of a; as controls in the regression. To account for the fact that @; is a generated regressor, we construct standard errors by 50 bootstrapped samples, drawing observations by radiologist with replacement and keeping the total number of radiologists fixed. These results are also strongly negative, but they are more negative than the true LATE. The confidence intervals are also substantially wider. In Panel C, we show results from indirect least squares regressions of m; on empirical Bayes posteriors of P; and a;. For Column 2, we control for the posterior mean @; linearly; for Columns 3- 6, we control for posterior probabilities that a; resides in each of 5, 10, 20, and 50 bins, respectively. We construct standard errors by the same bootstrap procedure that we use for Panel B. The estimates of the LATE are negative and less biased than in Panel B. Nevertheless, they are still generally larger in magnitude than the true LATE. These results suggest that we can recover the true LATE when we control for true skill. However, estimates are biased, albeit in the opposite direction in our simulation, when we use empirical Bayes posteriors of skill. In Appendix Figure A.21, we confirm that estimates from regressions that use empirical Bayes posteriors for radiologists with a very large number of cases approach the true LATE. Even so, the number of cases per radiologist is already high in our simulated sample. By construction, each radiologist has at least 100 cases, and we match the distribution of cases for each radiologist to the actual distribution, shown in Appendix Figure A.1. We leave further refinement of this approach in finite samples to future work. A.27 References ANDREWS, M. J., L. GILL, T. SCHANK, AND R. UPWARD (2008): "High Wage Workers and Low Wage Firms: Negative Assortative Matching or Limited Mobility Bias?" Journal of the Royal Statistical Society: Series A (Statistics in Society), 171, 673-697. A.28 Figure A.1: Distribution of Radiologists and Cases Mean = 31 : Mean = 7 Median = 6 Median = 26 Fraction Fraction 40 60 80 100 10 20 30 Radiologists Radiologists (104 stations) (16,766 station--month pairs) 4 Mean = 1,458 : Mean = 37 Median = 507 Median = 25 Fraction Fraction 0 2000 4000 6000 8000 10000 50 100 150 200 Cases Cases (3,199 radiologists) (124,832 radiologist-month pairs) Note: This figure shows the distributions of radiologists across stations, of radiologists across station-months, of cases across radiologists, and of cases across radiologist-months. As shown in Appendix Table A.1, the minimum number of cases for a radiologist is 100, and the minimum number of cases for a radiologist-month pair is 5. In this figure, we truncate the number of cases per radiologist at 10,000; 57 radiologists, or 1.78% of the total, have more cases than this limit. We truncate the number of cases per radiologist-month at 200; 1,274 radiologist-months, or 1.02% of the total, have more cases than this limit. A.29 Figure A.2: Covariate Balance (Miss Rate) C: Leave-Out (Balanced on Age) B: Leave-Out (All) A: False Negative (All) Age 4 0 » > Female Married Catholic 5 Baptist 4 hite 4 Black American Indian 4 Pacific Islander 4 Veteran 5 Distance Congestive Heart Failure 4 Cardiac Arrhythmias 5 Valvular Disease - Pulmonary Circulation Disorder 4 Peripheral Vascular Disorder Hypertension Uncomplicated Hypertension Complicated Paralysis > Other Neurological Disorder 4 Chronic Pulmonary Disease 4 Diabetes Uncomplicated 4 Diabetes Complicated 4 hyroid 5 Renal Failure 4 Liver Disease 4 Peptic Ulcer Disease AIDS HIV 5 Lymphoma - Metastatic Cancer Tumor Without Metastatis 7 Rheumatoid Arthritis 4 Coagulopathy - Obesity 5 Weight Loss 4 Fluid Electrolyte Disorders Blood Loss Anemia Deficiency Anemia 5 Alcohol Abuse + Drug Abuse + Psychoses + Depression 5 Previous Pneumonia (/10) 4 Previous Year Outpatient Visits 4 Previous Year Inpatient Visits 7 Previous Year ED Visits 7 Systolic Blood Pressure > Diastolic Blood Pressure Pulse 4 Pain = O2 Saturation (/10) 4 Respiratory Rate 4 Temperature (/10) 4 Fever (/10 Supplemental O2 Provided 4 Flow Rate of Supplemental O2 + Concentration of Supplemental O2 4 White Blood Cell Count 4 Urgent Request 4 Req. Phys. X-Ray Count 4 Phys. Pred. Diagnosis > Median 4 Phys. Pred. False Neg. > Median Phys. Diagnosis Propensity 4 Phys. False Neg. Propensity 4 o@ o @ 2) @ ee pee Phys. Fraction of Urgent Requests + X-Ray with Multiple Views 4 ® >t. T T T T T -15 -75 0 75 15 -.1 F-Stat: 194.22. p-value: 0.000 T T T T T T T -05 O 05 -1 -05 0 05 4 F-Stat: 2.64. p-value: 0.000 F-Stat: 1.28. p-value: 0.055 Note: This figure shows coefficients and 95% confidence intervals from regressions of the false-negative indi- cator m; (left column) or the assigned radiologist's leave-out miss rate (middle and right columns) on covariates X;, controlling for time-station interactions T;. The 66 covariates are the variables listed in Appendix A.2, less the 11 variables that are indicators for missing values. The leave-out miss rate is calculated analogously to the leave-out diagnosis propensity Z;. The left and middle panels use the full sample of stations. The right panel uses 44 stations with balance on age, defined in Section 4.2, The outcome variables are multiplied by 100. Continuous covariates are standardized so that they have standard deviations equal to 1. For readability, a few coefficients (and their standard errors) are divided by 10, as indicated by "/10" in the covariate labels. At the bottom of each panel, we report the F-statistic and p-value from the joint F-test of all covariates. A.30 Figure A.3: Predicting Diagnosis and False Negatives (Stations with Balance on Age) A: Diagnosis B: False Negative (Balanced on Age) (Balanced on Age) Age- © a Female + 5 e Married 5 J q Catholic 4 % P Baptist 4 ® ¢ White 4 r 4 Black 5 Se © American Indian 4 4 e Pacific Islander 4 ie , Veteran 4 © © Distance 4 i Congestive Heart Failure - Cardiac Arrhythmias 4 Valvular Disease 7 Pulmonary Circulation Disorder 4 Peripheral Vascular Disorder 4 Hypertension Uncomplicated 4 Hypertension Complicated Paralysis 4 Other Neurological Disorder 4 Chronic Pulmonary Disease - Diabetes Uncomplicated 4 Diabetes Complicated + Thyroid 5 Renal Failure 4 Liver Disease 4 Peptic Ulcer Disease 7 AIDS HIV 4 Lymphoma 4 Metastatic Cancer 4 Tumor Without Metastatis 4 Rheumatoid Arthritis 4 Coagulopathy Obesity 4 Weight Loss 4 Fluid Electrolyte Disorders 4 Blood Loss Anemia 4 Deficiency Anemia Alcohol Abuse + Drug Abuse + Psychoses - Depression - Previous Pneumonia (/10) 7 Previous Year Outpatient Visits 4 Previous Year Inpatient Visits 5 Previous Year ED Visits 4 Systolic Blood Pressure Diastolic Blood Pressure - Pulse Pain - O2 Saturation (/10) 4 Respiratory Rate > Temperature (/10) 4 Fever (/10) 4 Supplemental O2 Provided 4 Flow Rate of Supplemental O2 4 Concentration of Supplemental O2 5 White Blood Cell Count 4 Urgent Request 4 Req. Phys. X-Ray Count 4 Phys. Pred. Diagnosis > Median + Phys. Pred. False Neg. > Median 5 Phys. Diagnosis Propensity > Phys. False Neg. Propensity 4 Phys. Fraction of Urgent Requests + X-Ray with Multiple Views > T -6 -3 0 3 F-Stat: 333.29. p-value: 0.000 T T -1.5 -.75 0 T 15 F-Stat: 84.75. p-value: 0.000 T 1.5 Note: This figure shows coefficients and 95% confidence intervals from regressions of diagnosis status d; (left column) or the false negative indicator m,; (right column) on covariates X;, controlling for time-station interactions T; in the sample of 44 stations with balance on age (defined in Section 4.2). This is analogous to the left-hand columns of Figure VI and Appendix Figure A.2 respectively, with the restricted sample of stations. The outcome variables are multiplied by 100. The 66 covariates are the variables listed in Appendix A.2, less the 11 variables that are indicators for missing values. Continuous covariates are standardized so that they have standard deviations equal to 1. For readability, a few coefficients (and their standard errors) are divided by 10, as indicated by "/10" in the covariate labels. At the bottom of each panel, we report the F-statistic and p-value from the joint F-test of all covariates. A,31 Figure A.4: Randomization Inference A: Diagnosis 40 30 20 10 6 8 1 hw elm se ee B: False Negative 20 beans. oe oe 0 2 4 6 8 1 Note: This figure plots histograms of station-level p-values for quasi-random assignment computed using ran- domization inference. We first residualize predicted diagnosis and false negative indicators d; and #; by mini- mal controls T;. We then create 100 samples in each of which we randomly reassign the residualized values to patients within each station. For each of these samples as well as the baseline sample we regress the residual- ized values on radiologist dummies, and calculate the case-weighted standard deviation of estimated radiologist fixed effects. We then define the p-value for each station to be the share of the 100 samples that yield a larger standard deviation than the baseline sample. In each panel, light gray bars represent station counts among the 60 stations that fail the test according to age; dark gray bars represent station counts out of the 44 stations that pass the test according to age. A.32 Figure A.5: Variation in Radiologist Miss Rates Under Counterfactual Sorting Panel A: Full Sample .02 5 ce -015- 2 3% > oO TS D> 01° oO To Cc & a .005- 0 4 0 20 40 60 80 100 Percent of randomized observations Panel B: Stations with Balance on Age 02-5 < .015- 2 & > oO To Pp O17 oO To Cc & a .005 - 0 4 0 20 40 60 80 100 Percent of randomized observations Note: This figure plots the standard deviation of radiologist fixed effects in simulations on the y-axis in resorted data where ¢ € [0,100] percent of patients are randomly assigned to radiologists. The dashed line indicates the standard deviation in the observed data. Panel A shows results for the full sample. Panel B shows results for the sample of 44 stations selected for balance on age, as defined in Section 4.2. To construct the figure, we first residualize m; by minimal controls T;. We then create 101 samples. In each, we first reassign ¢ € {0,1,..., 100} percent of cases randomly and the remaining cases perfectly sorted by mm; to radiologists within the same station (holding the total number of cases for each radiologist constant), For each of these samples and the baseline sample, we regress the reassigned values on radiologist fixed effects and display the standard deviation of the estimated values. The shaded gray regions reflect 95% confidence intervals across 50 bootstrapped samples, drawn by radiologist blocks. The confidence interval corresponding to the dashed line in Panel A is ¢ € [96,99]; in Panel B, it is ¢ € [97, 100]. A,33 Figure A.6: Projecting Data on ROC Space Using Alternative Parameter Values A: Upper Bound of S B: Lower Bound of S$ ©& 1.00 & 1.00 ' o ou EE Ee > 0.75 > 0.75 © © g 0.50 g 0.50 8 8 9 0.25 Q 0.25 o o 3 . 3 F 0.00/- F 0.001- 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 False positive rate (FPR)) False positive rate (FPR)) C: High 2 D: Low a e 1.00 ° oO E © 0.75 © g 0.50 @ Q 0.25 oO a F 0.001- 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 False positive rate (FPR)) False positive rate (FPR)) E: High « F: Low x © 1.00 ° e 1.00 ° o oO Ee Ee @ 0.75 @ 0.75 © © g 0.50 g 0.50 3 3 Q 0.25 Q 0.25 oO oO = = F 0.001" F 0.001 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 False positive rate (FPR)) False positive rate (FPR)) Note: This figure plots the true positive rate (TPR,) and false positive rate (FPR ';) analogously to Figure V, under alternative values of prevalence (S), the share of X-rays not at risk for pneumonia (x), and the share of cases in which pneumonia first manifests after the initial visit (A). In Panels A and B, we consider upper and lower bounds for S, as defined in Section 4.1. In Panels C and D, we increase and decrease 2 by 50% relative to the baseline value A = 0.026. In Panels E and F, we increase and decrease x by 50% relative to its baseline value x = 0.336. Appendix C provides details on this projection. A.34 Figure A.7: Diagnosis and Miss Rates, Fixed Effects Specification False negative False negative A: Full Sample .03 5 025-5 025 Coeff = 0.079 (0.011) .015- N = 4,663,840, J = 3,199 04 06 08 4 Diagnosis rate B: Stations with Balance on Age .03 5 025 5 e 02-5 Coeff = 0.063 (0.015) .015- N = 1,464,642, J = 1,094 04 06 08 4 Diagnosis rate Note: This figure plots the relationship between miss rates and diagnosis rates across radiologists, using radi- ologist dummies as instruments. Plots are analogous to Figure VI. The x-axis plots pos and the y-axis plots FN;", defined in Section 4.3, both residualized by minimal controls of station-time interactions. Panel A shows results in the full sample of stations, and Panel B shows results in the subsample comprising 44 stations with balance on age, as defined in Section 4.2. The coefficient in each panel corresponds to the 2SLS estimate and standard error (in parentheses) for the corresponding IV regression, as well as the number of cases (V) and the number of radiologists (J). To account for clustering by radiologist, we test for first-stage joint significance by a comparing an F-statistic of the radiologist dummies with F-statistics in 100 bootstrapped samples, drawn by a two-step procedure by radiologist and then by patient (both with replacement). The p-value for the joint significance is less than 0.01. A.35 Figure A.8: First Stage .09- ° .08 5 ° o Doo" ° 2 50 2 07- Q SREP & °9) Oo a 00%, o fo) 064 9 Coeff = 0.330 (0.018) .05- N = 4,663,840, J = 3,199 04 06 08 4 12 Leave-out diagnosis propensity Note: This figure shows a binned scatter plot illustrating the first-stage relationship corresponding to Panel A of Figure VI. The y-axis shows residuals from a regression of diagnosis d; on the covariates X; and minimal controls T;. The x-axis shows residuals from a regression of the leave-out propensity instrument Z; on the same controls. The overall probability of diagnosis is added to residuals on the y-axis, and the average case- weighted Z; is added to residuals on the x-axis. We report the first-stage coefficient as well as the number of cases (N) and the number of radiologists (/). The standard error is clustered at the radiologist level and shown in parentheses. A.36 Figure A.9: Radiologist-Level Variation .08 - .06 - o . s hy, wn .04- . L " = 025 eee Ee "Ss Coeff = 0.079 (0.011) 0- Pe eee :* N= 4,663,840, J = 3,199 T T T T 05 .07 .09 11 Diagnosis rate Note: This figure shows the relationship between radiologists' miss rates and diagnosis rates. We collapse the underlying data in Panel A of Figure VI to the radiologist level by taking the average. Each dot represents a radiologist, weighted by the number of cases. The coefficient and standard error are identical to those shown in Panel A of Figure VI. A radiologist in the case-weighted 90th percentile of miss rates has a miss rate 0.7 percentage points higher than that of a radiologist in the case-weighted 10th percentile. We calculate this by subtracting the case-weighted 10th percentile residual from the case-weighted 90th percentile residual from the underlying case-weighted regression. A.37 Figure A.10: Distribution of Slope Estimates Across Stations 15- 10- Frequency -1 . 0 Estimate Note: This figure shows the distribution of station-level estimates of the slope A relating radiologists' miss rates to their diagnosis rates. Each estimate is computed using the analogous IV procedure to that used to produce Figure VI with data from a single station. In the figure, 73 out of 104 stations have an estimate of the coefficient greater than zero. A.38 Figure A.11: Area Under the Curve (AUC) and Skill (@) 1.04 So ~«o 1 o 0 1 0.74 Area under the curve (AUC) o ron 0.5 4 0.00 0.25 0.50 0.75 1.00 Skill (cx) Note: The Area Under the Curve (AUC) is the integral of an ROC curve. This figure shows the one-to-one mapping between AUC and the measure of skill a under the assumptions of our structural model. When @ = 0, the ROC curve coincides with the 45-degree line and AUC = 0.5. When a@ = 1, the ROC curve reduces to the left and top lines and AUC = 1. A.39 'JSIZOOIPeI Yovd IO] eI SST pUL SISOUSeIp oy) aye[NoTeo om 'ATTeUL "y AlTIqeqord TIM JISTA [ETM oy] Joye ssoyrueU jsIy eIUOUINOUd YOTYM UT soseo UBTSse om "eTUOUUNOUd 9ARY JOU Op pUe 'pasouselp jou "ysu 3e ore OY syuoyed Jo,j -/ proysamy) oysOUseIp s,JsIZO[OTpel ou] pue eTUOUMOUd Joy A P[OYseIy} oy} USATS 'UOIsIOSp sIsouseTp $,JSISOTOIPel sy) pue snjels eruOUNUd Jey) sUTMAIEp 0} {mM pue 44 TOM) SuNelNUNs UsU) 'y - | YSII Ie Buteg Jo AypIqeqold oy] WIM UOTNGLSIp [eruoUIG & WON eluounoud Jo Ysi 3 st Jane oy) JAYIOYM JOJ JOWOIPUT UL SUIMEID ISI 'eIep dy) UL ISIZOTOIpeI oy] 01 paUdIsse JoquINU st) 0} [enbe syuaned aye~nus vay am "f/f pue fo 4st3oporper Yor Joy SOA MID ISIY OM 'MOI PUODES DY} UT SJUSUIOU PoyeNUIIs Je SALUTE OT, '(MOI pUOdas 91]}) SoJBUNT]S [POUL UTeU Ino Woy ssaqTd poye[nuis pue sisjowesed poyeurse oy) BuIsN poye[NWIs s}USWOUT Oy} YIM (MOI ISIG OY}) BYEP OY) UL PoAJosgo syuSWOUT [eNjOe oy} soreduUIOD INT sty], -aonN ayes sisoubeiq ayes sisoubeiq aye SSI Sl0 OL'0 S00 00°0 80° 90°0 v0'0 c0° - St°0 OL'O g0°0 6¥0'0 = = 4909 cme 00°0 ne : Bee 200 . ' : : T n a : - + . " = oO oO . . a a ar, af roo 2 7 S 5 : olen 2 3 3 Jot 3 - ¢ 2 . . . 90°0 , 80°0 s]JUsWOY| peyejnwis :g aye sisoubeiq ayel ssi ayes sisoubeiq SL'0 0-0 s0'0 00°0 . 90°0 v0°0 20" . 010 S00 00" qa 0L0°0-= "90D. Bo OOO - ~~~ ego [Oo - ss Se 20°0 <0 mn n Ot GREG ee = o ® e a2 2 wet : . roo & > & 2 8 oe? : ate 2 5 5 PFN ot ® "3 "3 Sy, 90°0 80'0 S]JUBWO|| PEAJ@SAO "VY WA [POW :ZT'V amMNsLy A.40 Figure A.13: Distributions of Radiologist Posterior Means 400 300 300 ~ _> & 200 2 ® $ 200 oO oO © ® ~ 100 ~ 400 i 0 - 0:6 : 15 Qa T 500 10 400 > © 300 8 s a 3 200 rm 6 100 0 - 4 Correlation = -0.31 4 6 8 10 06 OO" 08 09 1.0 B a Note: This figure plots the distributions of radiologist empirical Bayes posterior means of our main spec- ification. The first three subfigures plot the distributions of skill @;, diagnostic thresholds 7* (a, i); and preferences Bj. The last subfigure plots the joint distribution of skill and preferences. The method to calculate empirical Bayes posterior means is described in Appendix E.3. A4l1 Figure A.14: ROC Curve with Model-Generated Moments 1.00 5 0.75 4 0.50 4 True positive rate 0.25 4 0.00 4 0.00 0.25 0.50 0.75 1.00 False positive rate Note: This figure presents, for each radiologist, the true positive rate (TPR;) and false positive rate (F PR;) implied by radiologist posterior means of our main structural specification. Radiologist posterior means 7; = (4; bi) are calculated after estimating the model, described in Appendix E.3, and are the same as shown in Appendix Figure A.13. Large-sample P; and FN; are functions of radiologist primitives, given by py; (v;) = Pr (wis > ily;) and po; (y;) = Pr (wis <T},¥i > 7| vi); given in Section 5. As in Figure V, TPR; = 1-FN,/S and FPR; = (P; + FN; -S) /(1-S). This figure also plots the iso-preference curves for B € {5,7,9} from (0,0) to (0,1) in ROC space. Each iso-preference curve illustrates how the optimal point in ROC space varies with skill for a fixed preference. A.42 Figure A.15: Heterogeneity in Skill A: Age B: Chest X-rays Focus 45 on > g 1 e * 35 o g 5 @ So z = 25 oO 2 a ect Coeff = 39.0 (1.9) Coeff = 0.257 (0.077) 407 __-- N= 11,876 15 N =3,199 7 8 2 a 8 Skill (x) Skill (cx) C: Log Median Time D: Log Median Report Length 7 4.2 8 5 > © fo wrt e---------- ee £ o - 6 ~ 39 - ° . 2 8 ian th S oa e °: ° .e oD °° . 25 < 36 c SO | WAAL & BS fT Re o OD foo S = Coett = -0.491 (0.204) Coeff = 4.42 (1.13 oeff = -0. : 4 N 2 O98 3.3L N= 3,128 7 8 9 ZT 8 2 Skill (a) Skill (cx) E: Medical School Rank F: Gender 400 9 . gL ye | wee eee 2 $3007 0 anne, s & © 8 8 . 8 5 200 . 7 . o a . a 3 . E 2 oe ee 57 8 100 °. eee © S ° 5 . ee Coeff = =269 (730) B Coeff = 0.531 (0.196) a oe}nt = VU. . of N= 1,697 6 N = 2,604 7 8 9 7 8 9 Skill (cx) Skill (a) Note: This figure shows the relationship between the empirical Bayes posterior mean of a radiologist's skill (@) on the x-axis and the following variables on the y-axis: (i) the radiologist's age; (ii) the proportion of the radiologist's exams that are chest X-rays; (iii) the log median time that the radiologist spends to generate a chest X-ray report; (iv) the log median length of the issue reports; (v) the rank of the medical school that the radiologist attended according to U.S. News & World Report; and (vi) gender. Except for gender, the three lines show the fitted values from the 25th, 50th, and 75th quantile regressions. For gender, the line shows the fitted values from an OLS regression. The dots are the median values of the variables on the y-axis within 30 bins of a. Appendix Figure A.16 shows the corresponding plots with preferences (8) on the x-axis. Some variables are missing for a subset of radiologists. For age, the result is based on a model that allows underlying primitives to vary by radiologist and age bin (we group five years as an age bin). See Section 5.5 for more details. Each panel reports the slope as well as the number of observations (N). The standard error is shown in parentheses. A.43 Figure A.16: Heterogeneity in Preferences A: Age B: Chest X-rays Focus 70 45 a ee on ---LL wrest > teenies ete ee . > © .% . 1 ge tea x 8 7 2 . tags tee . B eee S . .- 6 ._ * a Z@ 50; ~~~. * ey = 25 --f, e __ Te o Tots ea & TTT anna Coeff = -1.789 (0.278) Coeff = -0.012 (0.008) 40 N = 11,876 15 N = 3,199 7 8 9 10 6 7 8 Preference (B) Preference (f) C: Log Median Time D: Log Median Report Length N - iy S Pf eee eee ------0 77 £ 5S |------ - 6 ~ 39 = = . S oD ° ee e Da 3 e 2 5 < 36 e S a & Sf ne om ® | Loup eeeeesort7 ® S| ___-------- = Coeff = -0.389 (0.108) Coeff = 0.030 (0.019) 4 N =3,199 3.3 , N= 3,126 6 7 8 6 7 8 Preference () Preference (8) E: Medical School Rank F: Gender 400 9 gL x 2 . = s - 8 = 2 8 g 3 5 200 . . . : ® 2 . . a 8 . E 7 . So eee e e °° ® 100 . ° = 5 . . TT TTT TTS" Cosi = 18.8 (12.3) w Coeff = -0.034 (0.019) ow ? oeff = -0. A oL_, N= 1,697 6 N= 2,604 6 7 8 6 5 3 Preference (B) Preference (8) Note: This figure shows the relationship between a radiologist's empirical Bayes posterior mean of her prefer- ence (£) on the x-axis and the following variables on the y-axis: (i) the radiologist's age; (ii) the proportion of the radiologist's exams that are chest X-rays; (iii) the log median time that the radiologist spends to generate a chest X-ray report; (iv) the log median length of the issue reports; (v) the rank of the medical school that the radiologist attended according to U.S. News & World Report; and (vi) gender. Except for gender, the three lines show the fitted values from the 25th, 50th, and 75th quantile regressions. For gender, the line shows the fitted values from an OLS regression. The dots are the median values of the variables on the y-axis within each bin of 8. 30 bins are used. Figure A.15 shows the corresponding plots with diagnostic skill (@) on the x-axis. Some variables are missing for a subset of radiologists. For age, the result is based on a model that allows underlying primitives to vary by radiologist and age bin (we group five years as an age bin). See Section 5.5 for more details. Each panel reports the slope as well as the number of observations (NV). The standard error is shown in parentheses. A.44 Diagnosis rate variation remaining (%) Miss rate variation remaining (%) Note: This figure illustrates our method of calculating the variation in diagnosis and miss rates due to variation in skill and preferences. For x € [0,1], we first keep 8; unchanged and replace a; by (1-x)a;+x-a, where @ is the median value of a;. When x = 0, this step simply gives a;. When x = 1, this step replaces all a; with and thus eliminates all variation in a@;. We derive the new diagnosis and miss rates under different x, calculate their standard deviations, and divide them by the original standard deviation with x = 0. We perform a similar calculation by shrinking £8; to the median value B as x approaches 1 and keeping a; unchanged. Panel A shows the effect of reducing variation in skill or variation in preferences on the variation in diagnosis rates. Panel B shows the effect on the variation in miss rates. We report numbers that correspond to x =1 in Section 6.1. Figure A.17: Variation Decomposition A: Diagnosis Rate 100 5 90 4 50 4 25 50 75 100 Primitive variation reduction (%) -- Skill - Preference B: Miss Rate 100 4 75:4 50 4 25 4 25 50 75 Primitive variation reduction (%) 100 -- Skill - Preference A.45 Figure A.18: Counterfactual Policies 0.100 5 0.075 + oO 2 & 0.0504 = oO o & ® 0.0254 = 0.000 + -0.025 _, : : Social planner preference (B°) - Fixed threshold ---+ Fixed threshold (if skill were homogeneous) - - Improve skill to 25th percentile Note: This figure plots the counterfactual welfare gains of different policies. Welfare is defined in Equation (10) and is normalized to 0 for the status quo and 1 for the first best (no false positive or false negative out- comes). The x-axis represents different possible disutility weights that the social planner may place on false negatives relative to false positives, or B*. The first policy imposes a common diagnostic threshold to maximize welfare. The second policy also imposes a common diagnostic threshold to maximize welfare but incorrectly computes welfare under the assumption that radiologists have the same diagnostic skill. The third policy trains radiologists to the 25th percentile of diagnostic skill (if their skill is below the 25th percentile) and allows them to choose their own diagnostic thresholds based on their preferences. A.46 Preceived accuracy 1.04 S o 1 o © 1 ° yy 0.6 4 Figure A.19: Possibly Incorrect Beliefs about Accuracy "Ls angby "FAS ' "308 be a: (a mae eg e ae . R eo . . wet? e os ws Se 3 y eet . ee . 0.6 0.7 0.8 0.9 1.0 True accuracy Note: This figure plots the relationship between radiologists' true accuracy and perceived accuracy, in an alternative model in which variation in diagnostic thresholds for a given skill is driven by variation in perceived skill, holding preferences fixed. This contrasts with the baseline model in which radiologists perceive their true skill but may vary in their preferences. We calculate the mean preference from our benchmark estimation results at 8 = 6.71, and we assign this preference parameter to all radiologists. We then use the formula for the optimal threshold as a function of 6 = 6.71 and (perceived) accuracy to calculate perceived accuracy. Appendix G.2 describes this procedure to calculate perceived accuracy in further detail. A.47 Figure A.20: Comparing Results with and without Risk Adjustment A: Model Parameter Estimates = © a 4 " 8 ~ = 4 o -8*_ st a N N A Qi NS : . N . oO re O° -@e- § _§ N o N 2 = 8 x 2 Ve Ou Lg Os n Vv B: Variance Decomposition N a a & gs A - 34 g4 9 a 27 of "} 8 S a ° A " A N 8s 2 3 D, skill D, preferences FN, skill FN, preferences C: Welfare Under Counterfactual Policies 3 = © o A © wo o 4-e-- 2 4 = 4 N + ~. 4 to 4 * | -g- 9 -@- e 8 ° A ° 3 7 nN " | C1 C2 C3 C4 C5 C6 Note: This figure shows structural results from simulated data with heterogeneity in pneumonia risk across stations. We simulate data to match the actual data in the number of radiologists in each station and the number of patients assigned to each radiologist. The simulated data come from the data generating process described in Appendix G.3, which matches the baseline model in Section 5.1 but allows for heterogeneity in pneumonia risk across stations. We take model parameter estimates in Table I as the truth and additionally include station- specific thresholds V¢ to model heterogeneity in pneumonia risk across stations. In each simulated dataset, we re-estimate structural parameters using radiologist diagnosis and miss rates that are either unadjusted (shown in triangles) or adjusted by linear regressions controlling for station dummies (shown in circles). Panel A shows model parameter estimates, as defined in Table I. Panel B shows variance decomposition results that follow from the model parameter estimates, as described in Section 6.1. Panel C similarly shows welfare under counterfactual policies, as described in Section 6.2. Horizontal lines denote true values of each object. A.48 Figure A.21: Slope Estimates with Skill Controls, Radiologists Ordered by Volume Estimated slope 0 400 800 1200 1600 2000 2400 2800 3200 Top radiologists in sample Note: This figure shows 2SLS estimates in simulated data of A* in subsamples of radiologists ordered by volume. A* is the LATE of diagnosis d; on false negative m; (i.e., -Pr(s;)), which we should obtain in valid judges-design (IV) regressions examining relationship between radiologist diagnosis and miss rates. We regress m; on d;, instrument d; with the leave-out diagnosis propensity Z; in Equation (4), and control for the empirical Bayes posterior mean of radiologist skill. Each estimate is based on a subsample of radiologists included in order of volume (from highest to lowest volume). The far-right end of the x-axis shows the estimate from the full sample; that estimate corresponds to Column 2 of Panel B in Appendix Table A.11. The 95% confidence interval is shaded in gray; standard errors are clustered by radiologist. The true estimand, A* = -0.154, is shown in the dashed line. Appendix G.4 provides further details. A.49 'days Yous Joye ssIZo[orpes Jo Jaquinu ay) pue 'seseo Jo Joquimu oy} 'sdoys UoNe[es ofdures Aoy soqhiosep [qui st], azony Sosvo SUIUICUIOI ODT 661°E Ors"E99'r Ue) Jomo; YIM sisTsoporpes dos *L yuoujsnf[pe-ystI Jo suoteoyroeds uotssoiser Ino [je Ul ? 7 Jo wed se suonovsioyut JeoXA-YUOUL OpNpOUL 9M BOUTS '(ROOT 'Te 19 SMOIPUY) SUOTIPAIASQO ¢ UPY} JOM LLOS 97S 'CLP seig Aypigow poywuy] ysurese soyesyrw siyy, = YT sured yUoW-js1Soyorpes doi "9 OC UeY} SSoJ JO OOT €87'9 L8L'LIS'+ Uey) JoyeoIs o3e YIM sjuoned doiq '*¢ Jopues JO o8v JUST}ed Io AJNUSPT IsISOTOIpeI €87'9 S86'€78'P SUISSTUN YIM SUOTeAIOSGO dol "7 sep O¢ UIYIIM SkbI-X 3S9Y9 JoLd OU YIM SABI-K S49 [ENIUI UO snd0j OM "(S}ISTA UINJAI "3"9) Kel-X jSoyp ISP] OY} Woy skep OF €87'9 OSS'8Z38'P souloojno JuoNbasans Ul poysoIOJUT OIe OM DOUIS --s- 4 SBOT Je Ole Jey) sAep-jUoNed urlejoy "¢ Aep-yaoryed oy) ut ABJ-X JSOYO ISI OY 0} SuIpucdsad0d IsTsoorpel ay} 0) Aep-juoned au} USIsse om 'sABI-X ISO woTNeAJasgo au0 OUT Aep-juonjed PEED Iv3'L7r's oY} SUOUIE s}sTSO[OIpeI o[dy[NUI ore sJay) JT ~-s-B «UT. SA I-X JSaYO o[dn[nu asde]jos °z .2)9[duI09,, aq 0) ABI-X SOY 94) JO snye}s oy} aIINbor om pur 'OZOTL pue SAISNIOUI "ST OZ OLOLL JO sapoo (L.dd) A8ojouruniay, TeInpsso1g Joquiajdas 01 666 19q019Q Woy Oce'9 C66'EZS'S yuan ay) Aq sABI-K }SOYD OUTOP 9A, suOneAIOSgO AvI-X JS9Y [Je 1999S "| S]SISO[OIPey Sosec) wondroseq days ojdures uonogjag a[dures : [VW 21qeL A.50 Table A.2: Patient and Order Characteristic Variables Category Variables Demographics Age, indicator for male gender, indicator for married, 2 (13 variables) indicators for religion (Roman Catholic, Baptist, other religion as omitted), 4 indicators for race* (Black, White, American Indian, Pacific Islander, Asian/other race as omitted), indicator for veteran, distance between home and VA station performing X-ray* Prior utilization Previous year outpatient visits, previous year inpatient (3 variables) visits, previous year ED visits Prior diagnoses 31 Elixhauser indicators (dividing hypertension (32 variables) indicator into 2 indicators for complicated and uncomplicated hypertension), indicator for prior pneumonia Vital signs and WBC __ Systolic blood pressure*, diastolic blood pressure*, count pulse*, pain*, O2 saturation*, respiratory rate*, (21 variables) X-ray order (8 variables) temperature*, indicator for fever, indicator for supplemental O2 provided*, flow rate of supplemental O02, concentration of supplemental O02, white blood cell (WBC) count* Indicator for urgent order, indicator for X-ray with multiple views (CPT 71020), number of X-rays by requesting physician, indicator for above-median average predicted diagnosis (based on the 13 demographic variables) of requesting physician, indicator for above-median average predicted false negative (based on the 13 demographic variables) of requesting physician, requesting physician leave-out share of pneumonia diagnosis, requesting physician leave-out share of false negatives, requesting physician leave-out share of urgent orders. Note: This table describes 77 patient and X-ray order characteristic variables used as controls. * behind a variable denotes that we include an additional variable to indicate missing values; there are 11 such variables. Predicted diagnosis and predicted false negative are predicted probabilities formed by running a linear probabil- ity regression of diagnosis indicator d; and false negative indicator m;, respectively, on demographic variables to calculate a linear fit for each patient. These predicted probabilities are averaged within each requesting physician. A.51 Table A.3: Covariate Balance Stations with All Stations Balance on Age Panel A: Diagnosis and Leave-Out Diagnosis Propensity Leave-Out Leave-Out Diagnosis Diagnosis dy dy Diagnosis Propensity ad Propensity Demographics 13 3,198 458.62 4.63 1,093 0.91 [0.000] [0.000] [0.538] Prior diagnosis 32 3,198 550.12 3.60 1,093 1.44 [0.000] [0.000] [0.055] Prior utilization 3 = 3,198 833.74 11.00 1,093 1.79 [0.000] [0.000] [0.147] Vitals and WBC count 21 3,198 1341.36 4.01 1,093 1.00 [0.000] [0.000] [0.463] Ordering characteristics 8 3,198 238.20 7.61 1,093 4.32 [0.000] [0.000] [0.000] All variables 77 ~=--3,198 608.20 2.28 1,093 1.40 [0.000] [0.000] [0.015] Panel B: False Negative and Leave-Out Miss Rate False Leave-Out Leave-Out dy dy Negative Miss Rate do Miss Rate Demographics 13. 3,198 456.37 4.43 1,093 1.98 [0.000] [0.000] [0.019] Prior diagnosis 32 = 3, 198 318.08 2.84 1,093 1.45 [0.000] [0.000] [0.053] Prior utilization 3 3,198 1044.72 9.57 1,093 0.25 [0.000] [0.000] [0.863] Vitals and WBC count 21 3,198 516.95 4,21 1,093 1.23 [0.000] [0.000] [0.213] Ordering characteristics 8 3,198 304.37 11.26 1,093 2.32 [0.000] [0.000] [0.018] All variables TT = 3,198 194.22 2.64 1,093 1.28 [0.000] [0.000] [0.055] Note: This table presents results of joint statistical significance from regressions of different outcomes on groups of patient characteristics. Each cell presents the F-statistic of the joint significance of a group of patient characteristics in a regression of an outcome, controlling for minimal controls T;. Panel A mirrors Figure IV, where Column 1 uses the diagnosis indicator as the outcome and Columns 2-3 use assigned radiologist's leave-out diagnosis propensity. Panel B mirrors Appendix Figure A.2, where Column 1 uses the false negative indicator as the outcome and Columns 2-3 use assigned radiologist's leave-out miss rate. In both panels, Columns 1 and 2 show regressions using the full sample of stations with 4, 663, 840 observations and Column 3 shows regressions using the sample of 44 stations with balance on age with 1,464, 642 observations, described in Section 4.2. d,, the first degree of freedom of the F-statistic, corresponds to the number of covariates; d, the second degrees of freedom, corresponds to the number of radiologists minus 1. The p-value corresponding to each F-statistic is displayed in brackets. Patient characteristics are described in further detail in Section 3 and Appendix Table A.2. Appendix Figure IV shows estimated coefficients and 95% confidence intervals for regressions with "all variables" in Panel A; Appendix Figure A.2 shows estimated coefficients and 95% confidence intervals for regressions with "all variables" in Panel B. A.52 "dnoiZ yoro ut soseo Jo Joquinu ou) Ae[dstp om 'foued Yous Jo MOI Ise] SY) Ul "Zp UOTIIS UI Jon poquosap 'ose UO D0URTeq YJIM SUOTEIS Ul S}[NsoJ sMoYs g [oURY 'sUOnEIS [eB Ul S}[Nsor sMOYS Y [oUeg "JSISoO[orpes Aq SULIBISNIO pue 'JUeJSUOD B INOUJIA "JOJOIpUT UeTpou-MoOloq B PUL JOJCOIPUL ULIPSU-dAOge Ue UO SUI0NINO oY} SuUIssarse1 Aq poynduros ore sesoyjuored UI UMOYS SJOLIO prepuRis °g pur ¢ SULIN[OD UT USAT ore sdnois usemjeq soouarayjip 'dnors yors ut syuened ssojoe suloojno pezifenpisos Yours Jo UROUI BU} MOYS C-p pue Z-] sUUINOS "Fy, Aq poysn{pe-ystI ore souIODjNO SOY, "ZV SIGeL xipuoddy pue ¢ woNde¢ UT [IeJop Jay)MJ Ul poquosop 'se[qeiweA oNsTIaJOKIeYO JUoMed // SUIsN suOTssoIser Aq POUIOJ ore SANeSOU osTe] pojorpeid pue stsouserp Powiparg "4 oanesou osyey poyorposd pure "tu oaneSou os[ey [enjoe "7p sisouserp pojorpaid '!p stsouserp jenjoe Aq posedwioo ore sdnoss juoned oy) 'Jourd youo uy "LT, sjomuoo yeurrur Aq poysn{[pe-ysu JoyjIny '¢-p UONIeg UI pouyap (9-p suumjo) 'AW saver stun Jo (¢-] suuNfoD) sad SOJEI SISOUSLIP S,JSIZO[OIpel pousisse sqo-- TIoy} JO SonyeA ULIPSW-MoToq pue -aAOge WIM sdnowd om} OFUI syuoTJed OplAIp OAK "SONSTJO}ORIeYO JUoTJed Ul DOURTeG BUISsosse s}[NSO sJUasoId 9[qQe} SILL], "AON IZE'ZEL IZE'ZEL OZE'ZEL 7ZETEL sased Jo JoquinN (700'0) (€00°0) (€00°0) (700'0) (€00°0) (€00°0) Z00°0- LIZZ 6177 800'0 TT vIZ7Z SATIESOU as[ey PooIpolg (¥Z0'0) (yI0'0) (€10'0) (ZZ0'0) (910'0) (9100) 60S'0 O8r'Z IL6'1 760'0 ELTZ 6LV'Z sAneSou 9S]e,] (v10'0) (O00) (OT0'0) (ST0'0) (O10'0) (O10'0) 000°0 80F'L LOV'L Z10°0 vIV'L ZOr'L SISOUSBIP poloIpold (ZS0'0) (or0'0) (9€0'0) (9S0'0) (S¢€0'0) (T€0'0) 1vz'0 E19'L ELEL SSI 680'8 106'9 sisouselq asy uo oouRTReg WIM suonei¢s :q Jourg O16' TEEZ OL6 TEE CIO' IEET S76 'TEES S9Sed JO JoquINN] (900°0) (700'0) (S00'0) (900'0) (700'0) (S00"0) 9€0'0 061°Z vST'Z 9r0'0 S617 6r1'Z SATeSOU asTey PoloIpot (S10°0) (T10'0) (T10'0) (LT0'0) (Z10°0) (€10°0) 685°0 LOV'Z 8L8°1 6r1'0 977 860°Z sANeSOU OS[e] (ZZ0'0) (vT0'0) (T0'0) (ZZ0'0) (ST0'0) (L10°0) 8IT'0 LvO'L 6269 v71'0 OSO'L 976'9 SISOUSBIP poloIpalg (L70'0) (Z€0'0) (9€0'0) (Sr0'0) (0€0'0) (6Z0'0) I8€0 6LVL 86L'9 Ore T BSO'L 8IE9 sisouselq aidures [ny Vv [oued sousIayjiq UvIps|-cAOqy = URTpoyA[- Molo SouaIojIq] UeIpe|y-cAOgyY = URTIPoyA[- MOTO JIeY SST] ayey SIsousvIq soured '7° 1921 A.53 Table A.5: Statistics on Radiologist-Level Moments Percentiles Mean SD 10th 25th 75th 90th Panel A: Observed, Risk-Adjusted Diagnosis rate probs 0.070 0.010 0.059 0.065 0.074 0.082 Miss rate FN» 0.022 0.005 0.017 0.019 0.024 0.027 Panel B: Also Adjusted for & = 0.336 and A = 0.026 Diagnosis rate P; 0.105 0.015 0.089 0.097 0.112 0.123 Miss rate FN; 0.010 0.007 0.002 0.006 0.013 0.018 False positive rate FPR 3; 0.068 0.019 0.048 0.057 0.078 0.090 True positive rate TPR; 0.802 0.131 0.654 0.748 0.878 0.959 Note: This table presents statistics for various radiologist-level moments. Panel A shows raw risk-adjusted diagnosis and miss rates, which are fitted radiologist fixed effects from regressions of d; and m; on radiologist fixed effects, patient characteristics X;, and minimal controls T;, respectively. Panel B adjusts for the share of X-rays not at risk of pneumonia (& = 0.336), calibrated in Section 3, and the share of cases whose pneumonia manifests after the first visit (A = 0.026), estimated in Section 5.2. False positive rates and true positive rates are then computed using the estimated prevalence rate ($ = 0.051). All statistics are weighted using the number of cases. See Appendix C for more details. A.54 "S]IeJOp aIOUI JO} CQ xIpueddy seg 'juoUMNsUT o[duIes-osIoAaI B BuUIsN '(¢°q) UONeNby Wo s}[Nser sMOYs g [OUR 'JUSWINI|SUI jNO-oAeg] pepuRys B BUISN '(p'q) uonenby Woy s}fnser smoUs VY [oUk 'SOTMMUINp UOTIEdOT YIM Po}OBIOJUL SOTUUINp oun) pue '7'W 91qRI, xIpueddy puke ¢ UONDeg UI pequosap 'sonsLiajovseyo yuoned JO] So|QUIRA {/, JOJ SUTTONS 'sIsOUSEIP UO JUSUIMASUI INO-dAv] B JO JO9IFa OY) JO SUOTSSAIZoI O8e}s ISIY UNI OM 'o|duresqns Yors Ul "sUOTAIasqo Jo o[duresqns JUSIOFIP & 0} spuodsalos UUINJOO YORY 'oIMJeIo] USIsop-sospnf 34) UT pJepue)s are yey) AJOTUOJOUOUT JO $}S9} [PUIOJUI Wo] s}[Nsel sMOYs 91QR) STL "asoNy Sox SOX Sox sox Sox Sok SOx SOX sjouos Juoneg SOX SOX Sox Sox SOK SOK Sox SOX sjoo]Jo poxy UoTels x SUIT], 86r'007T é69S'IZE'E CHL'OLS'T 6r9°OV0'E 906TEEZT Y68TEET O9STEET TOB'IEET SUOTIPAIOSQO) €L0°0 690'0 6S0'0 ¢L0'0 6110 Tz0°0 680°0 T¢0°0 SUIODINO UBSIAL (610'0) (800°0) (v10'0) (010'0) (Z€0'0) (900'0) (910'0) (600'0) r vre0 9210 €Sc'0 6810 IvL'0 8010 v8e'0 8910 s_Z "yOOUNISU] o[dures-osioA0y sq [oueg OVZ'LOTT OLV'OSr'E STOSLST OS9°880°E S06 TEET 968IEE'T OOSIEE'T TCOBTEET SUOT}BAIOSGQ €L0°0 690°0 6S0°0 SL0'0 6110 Tz0°0 680°0 TS0°0 SUIOSjNO UBsIAl (TZ0'0) (110°0) (L100) (Z10°0) (810°0) (600'0) (S100) (€10°0) r €€7'0 €se0 0870 9re'0 C870 6rT'0 elv0 0€7'0 'Z "yuoun]ysuy ouTjaseg :Vy jourg _ (P) dd (P)4d SUINIYSIN oumAeq ddI AA-UON OUT AA MO] ust Josunox JOPIO oiduesqns 'p 'pasousriq :oWl09jnO S}say, AJIOTUOJOUOJ[ [PULIOJUT :9°V F1GRL A.55 Table A.7: Judges-Design Estimates of the Effect of Diagnosis on Other Outcomes Stations with Outcome All Stations Balance on Age Admissions within 30 days 1.114 0.633 -0.076 0.587 (0.338) (0.219) ED visits within 30 days 0.146 0.290 -0.385 0.290 (0.121) (0.201) ICU visits within 30 days 0.201 0.044 -0.088 0.042 (0.051) (0.067) Inpatient-days in initial admission 10.695 2.530 0.588 2.209 (2.317) (2.193) Inpatient-days within 30 days 11.383 3.330 -1.123 3.043 (2.059) (1.879) Mortality within 30 days 0.150 0.033 -0.126 0.033 (0.032) (0.057) Note: This table presents results using the assigned radiologist's leave-out diagnosis propensity in Equation (4) as the instrument to calculate the effect of diagnosis on other outcomes, similar to the benchmark outcome of false negative status in Figure VI. All regressions control for 77 variables of patient characteristics, described in Section 3 and Appendix Table A.2, and time dummies interacted with location dummies. Columns 1 and 3 give results of the IV estimates. Standard errors are given in parentheses. Columns 2 and 4 report mean outcomes. Columns 1 and 2 show regressions using the full sample of stations; Columns 3 and 4 show regressions using the sample of 44 stations with balance on age, described in Section 4.2. A.56 "JOAQT ISIZO[OTpeI ou] ye '"JUoUISO"|doI YIM 'deys}oo0q yoorq Aq poynduros are 'sosoqjuorzed ul UMOYS "gq [OUR JO} SIOMIA pIepur}S "UOISSNOSIpP JoyLIN] pue suoneyuoUlsdu asoyj Jo Yous JOJ ayeuoTeI sapraoid y xipueddy 'A[qrxoy Axes 0} 'ff pue {0 usemjoq UOne[aLIOo ay) 'o SuIMoTe Inq 'UONROyIOAds auTpaseq sy) UI aNTeA poyeUso oy] Je y Surxy Aq payeuse synser syussaid ,.o a[qrxoy 'v XI4,, 'Souswojovseyo yuoned pue 'own 'suoneys Joy SuNnsnf{pe ynoyIIA soyel SST pu SISOUSIP MBI OY) SUISN poyeUIyse sy[Nsor syuaseld ..sjoUOD ON,, "suoTIe}s pue ow sursn Ajuo yusuNsn[pe-ysIr suOjied _sjoyuoS WINWITUT,, 'UoIsstupe jo AyIqeqoid ysny e YIM sjusned ur ATUO soAneSoU ase] soUSp ,.UOISsMUpy,, 'OIBOIPS|A] Ul UY) WA, OU) UL SJISTA [e}0} QIOU YIM SUBIOIOA Jo odes & 0} SJOLNSOI ,SJOSN WA,, 'JUOWUSISse WopuUeI-Isenb WIA AJHUSPT OM sUOTeIs pp 94) UO ATUO PoyUIT}so s}[Nsos syuesold ,.poouRleg,, "s}[Nsel oUTTeseq IMO syuosold ,cUT[ASeg,, "SUOTeIUOUIS[TUIT SATLUIOI[E JOpUN S}[NSoI JO ssoUjsNgOI SMOYS IGE) SIU, -asoNy (910°0) (¥Z0'0) (O10°0) (610°0) (910°0) (6S0'0) (610°0) 6960 7660 LL60 TL6°0 186°0 186°0 9960 eouarayord WO; ) 0) (910'0) (6Z0'0) (0S0°0) (8+0'0) (vL0°0) (970'0) (aval cIlO OLT'O LI@0 PLO LLVO 0¢c'0 TIS Wo sTuf) OATIBSOU OSTB] (1L0'0) (TS0'0) (8S0'0) (980'0) (€OT'0) (0ZT'0) (6L0'0) OIL0 c18°0 99L°0 F190 989°0 ScL''0 6020 souarajord UOJ) (vr0'0) (8S0'0) (7S0'0) (LS0'0) (690'0) (€9T'0) (9S0'0) S190 ose'0 STS'0 SILO 6190 ve90 e190 TEAS UOFTU/) sISOUsvIq, uonIsoduioseq UoTRLeA 7g [oUeg 661'E 661°E 661°€ 661°€ 661°€ 760'T 661°E S\sIZojorpes Jo Joaquin] Ors'e99'r Or8'E99'r OF8°E99'r T09'E99'r TIZ660°'E 7H9'POHT OF8°E99'r SUONPeAIOSqO JO TOquINN| 1620 6810 OL7'0 1070 LS¢'0 vre'0 167'0 AI 'ado[g v6r'0 089°0 O1s'0 9¢r'0 LLS'0 LSv'0 v6r'0 [enpisaz catjesou asyey Jo WS 6670 cSL'0 ceo 0 Lcv'0 08°°0 1970 6670 SMEs oaTIeSoU ase} JO TS ec0'l 996'T Teel LcO'T S601 T€0'T ec0'l sisouselp Jo GS S]UOWIOJ] WIO,J-poonpoy pur eyeq :Vy [ourg daiqrxoy = SFONUOD ON ~-s STOJUID UOIssTUIpy SIOSN VA poourreg ouljoseg 'Y XI UWINUOTUTTA suoTeoytoods sANeUIAITY :8°V e19G2L A.57 '] 9148], UI possnosip se oures oy) ore VY [uv Ul sIojowEIed oY, 'g'V 91qQR], xIpuoddy se owes om) are 'uoneoytoeds aaneuraye ue 0} Supuodsazi0s yore 'sUUNTOD sy], 'suUOMBOyIONdsS SATRUIOI[Y JOpUN s}[NSeI ssau}sNgoI oY} JO S[IeJop [PUODIPpe SMOYS 9[qP} SILT, /a70AT OEE 1 194'T POET 6EE I ZIV'T 067'T OEE 1 aThusosed W106 LOTT CLOT 6EI'I Soll €61'T Sell coll ausciod WO] Es7T LOT TST I Es7T LOE'T CITT 7ST IT L ueayy TITS 8816 OIr'6 EST Il 98I°8 9Sr' v1 606°L a[Husored W106 618'S LyTS vEs'9 P8783 csr's ELO'IT 965°S aynuaorad WOT 8769 OITL 076'L €TL'6 99L'°9 VEO EL €IL9 g ues 6760 6860 0r6'0 vL8°0 LE6'0 €£8'0 E60 aThusoled W306 vrL'0 7PS'0 6890 Lv9'0 1¢9'0 0190 9SL'0 g[Husored WOT Lv8'0 9780 Z£8'0 69L'0 908°0 87L'0 680 @ ues SOATTULY ISISOpOIpeYy :q [oued 9€£0 9€£0 9¢£°0 9€¢°0 9¢¢°0 9€£0 9€£0 ¥ (8910) - - - - - - 9S0'0- - - - - - - d (Sz1'0) ($r0'0) (os0'0) (9600) (¥L0'0) (1970) (160°0) 679'T L6S'I 189'T vOL'T 8L9'T €L8'T Ceol A - (Z00°0) (100°0) (100°0) (Z00'0) (900°0) (100°0) - ¢Z0'0 L7'0 9T0°0 77Z0°0 620°0 9720°0 v (sso0'0) (7900) (T€0'0) (vE0'0) (Lv0'0) (€61'0) (¥r0'0) O€T'O 0770 €vl'0 8El0 6ST°0 780°0 9ET'0 Jo (6r€'0) (ZST'0) (LZ1'0) (€S7'0) (E70) (Z€9'0) (6v7'0) 8761 8E61 650° 990°7 006'°T 9ST 6681 on (Z¢0'0) (0L0°0) (Z€0'0) (0€0'0) (9¢0'0) (€S7'0) (6Z0°0) 670 P8L'0 €8E'0 9r7'°0 1Zr'0 L770 96770 0 (vOE'0) (8rT°0) (SET'O) (9070) (9ST'0) (096'0) (6170) 1160 160°I 068°0 0780 6080 910 Cv6'0 on SOJVUTSY IoJOUIvIEY [OPO :VW [oueg daqrxep = SJONUOCDON = STONUOD=-s«COTISSTIPW Ss SON WA pooureg ouljaseg 'Y XI UMN (reroq [euoNIppy) suoneoyloodg saneusayTy :6°V 1981 A.58 Table A.10: Model Results Under Alternative Values of x Panel A: Value of x K 0.168 0.336 0.504 Panel B: Model Parameter Estimates Ha 1.023 0.945 0.798 Ca 0.291 0.296 0.311 Hp 1.916 1.895 1.863 Op 0.143 0.136 0.129 a 0.020 0.026 0.035 Vv 1.740 1.635 1.499 Panel C: Variation Decomposition Diagnosis, Uniform skill 0.627 0.613 0.618 Diagnosis, Uniform preference 0.698 0.709 0.694 False negative, Uniform skill 0.224 0.220 0.216 False negative, Uniform preference 0.965 0.966 0.967 Note: This table presents the analogous results in Table I under different values of x. In the baseline estimation, x =0.336 is calibrated as the fraction of patients whose probability of having pneumonia predicted by a machine learning algorithm is smaller than 0.01. We use two other values of « that represent a 50% decrease (Column 1) and 50% increase (Column 3) around the calibrated value (Column 2). Panel A shows model parameter estimates corresponding to these alternative thresholds. Panel B shows the variation decomposition under these alternative thresholds. Parameters are described in further detail in Sections 5.1 and 5.2, and counterfactual variation exercise is described in further detail in Section 6.1. A.59 Table A.11: Slope Estimates Controlling for Radiologist Skill (1) (2) (3) (4) (5) (6) Panel A: True Skill Diagnosis 0.096 -0.124 -0.132 -0.147 -0.155 -0.156 (0.016) (0.014) (0.019) (0.019) (0.017) 0.017) Panel B: Skill Posteriors Diagnosis 0.096 -0.342 -0.575 -0.668 -0.698 -0.752 (0.016) (0.084) (0.084) (0.119) (0.143) (0.237) Panel C: Indirect Least Squares Diagnosis 0.096 -0.251 -0.364 -0.369 -0.208 -0.051 (0.016) (0.043) (0.034) (0.036) (0.058) (0.119) Note: This table presents slope estimates in simulated data of A*, or the LATE of diagnosis d; on false negative m;, based on IV regressions identified by the judges-design relationship between radiologist diagnosis and miss rates. Column 1 in all panels presents the same specification, akin to the benchmark IV regression in the paper, instrumenting d; with the leave-out diagnosis propensity Z; in Equation (4), with no further controls. For Panel A, we additionally control for true (simulated) radiologist skill a;. For Column 2 of this panel, we control for linear @;; for Columns 3-6, we control for indicators for each of 5, 10, 20, and 50 bins of @;, respectively. For Panel B, we use the empirical Bayes posteriors instead of true skill, defined in Appendix E.3. For Column 2 of this panel, we linearly control for the posterior mean of a; for Columns 3-6, we control for indicators for each of 5, 10, 20, and 50 bins of this posterior mean, respectively. Panel C shows results from indirect least squares, regressing m; on posteriors of P; and a@; by OLS. For Column 2 of this panel, we control for the posterior mean of a;; for Columns 3-6, we control for posterior probabilities that a; resides in each of 5, 10, 20, and 50 bins, respectively. Standard errors, shown in parentheses, are clustered by radiologist. In Panels B and C, standard errors are computed by 50 samples drawn by block bootstrap with replacement, at the radiologist level. We compute the true estimand A* = -0.154. Appendix G.4 provides further details. A.60