Building the Layer Cake: A quick-value approach to AI and machine learning

For every truck in their fleet, UPS currently recalculates the best route after every single package is delivered. So, if you and your neighbor are both getting a package, in the time in takes to deliver them, UPS has determined the optimal route for all remaining packages on the truck … twice! This is a machine learning feat that required satellite launches, GPS and map experts, network theory, theories of optimization from mathematics, ten years, and hundreds of millions of dollars (see this Harvard Business School article). However, UPS didn’t start with this complex solution when they decided to optimize routes.

The following story I heard at the recent Data Science Connect (virtual) conference a few weeks ago. It contains a powerful lesson for thinking about complex projects.

As it turns out, the first thing UPS did was simply make sure that all the packages on a given, single truck were headed to the same neighborhood (or delivery area). They made a simple, common-sense change that immediately delivered value. Additionally, this change set them up for the eventual big-budget, time-consuming project.

If the packages on a single truck need to go to many different parts of a city, there is a limit to how much route optimization can improve things. It would be like asking, “what’s the best way to pay my bills by driving one penny at a time to each business I owe?” We can figure out an optimal route, but maybe there are some more basic problems to solve here first.

There is a general principle for AI or Big Data projects that we can extract from this story, and to illustrate it, let’s think about cake. I’ve taken this metaphor from Bill Franks’ (Chief Analytics Officer of the International Institute for Analytics) wonderful talk (Franks 2020 — full reference below).

Rainbow Layer Cake clipart. Free download transparent .PNG | Creazilla
Layer Cake. Image used with permission under a Creative Commons CC0 1.0 Universal Public Domain Dedication.

Consider a layer cake. When you decide to make one, you don’t go straight from having no cake to having four layers of cake (Figure above). Rather, you put down the first layer and then proceed to build on top of it with each layer setting up the conditions for the next. First, UPS had to group similar delivery area boxes onto trucks. Then, they had to figure out how to optimize the truck’s route once. Finally, they figured out how to optimize routes in real time.

Notice that each “layer” of the UPS optimization cake needed the layer before it. Deciding what delivery area packages go in what truck set up the success of optimizing routes. Then, it doesn’t make sense to try optimizing a route hundreds of times a day before you try optimizing it once. So, the single-optimization layer sets up the success of the real-time optimization layer.

Additionally, the construction of each layer provided immediate value. Grouping packages was better than not. Optimizing the routes once was better than no optimization. And finally, real-time adjustment is better than single optimization at the start of the day.

This story has informed my thinking. Now when researchers ask me about machine-learning tasks (e.g., predicting X a certain number of minutes before it happens), I think about what the simpler base layers of the cake might look like. I think about what deeper, basic questions we can answer that will provide quick value while setting us up for success on the big (whole layer cake) question.


Franks, Bill. (2020, October 7-9). Scaling Data Science & Analytics [Conference presentation]. DSC 2020 Conference. Virtual.

When has a significant change occurred?

Our department (Anesthesiology and Perioperative Medicine) is constantly engaged in research and quality improvement (QI). Both pillars of the department often involve measuring the change caused by an intervention. For a large swath of such projects, a statistical test (e.g., t-test or chi-squared test) can indicate if there’s a difference between two groups of data. However, if that data has a time component, things can get more complicated.

Through this post, I hope to provide exposure to a segment of statistics called “Interrupted Time Series,” providing enough information for you to have an idea of when it might be an appropriate analysis method for your own work. I’ll avoid the mathematics and deeper details but provide enough details for you to recognize the utility and know what to search for if you’re interested in learning more.

A branch of time series analysis called “interrupted time series” (or “ITS”) helps with quantifying the effects of an intervention. Interrupted time series applies when there is a set of time-bound data before an intervention, a clear time period when an intervention occurs, and a set of time-bound data after the intervention. The core ideas of ITS come from middle school math — slopes and intercepts. ITS is a rigorous way of asking, “Did the slope change after the intervention,” “Did the intercept change after the intervention,” and “Are these changes statistically significant?” Some fields call this method “quasi-experimental study design” or “differences-in-differences”, and the particular charts I show below are all examples of segmented regression — a method within ITS.

For example, suppose some policy change is intended to reduce the rate of naloxone administrations per month. Before the intervention, we might have number per thousand of patients administered naloxone each month. Our time period of intervention would be when the policy goes into effect. After that date, we would continue to record administrations per thousand each month. To understand why such an experiment requires a special kind of analysis, let’s consider a few potential results in the figure below, where the black vertical line represents the time when the policy went into effect. Note that these plots are completely fake and are (hopefully obviously) exaggerated for effect.

Figure 1 of Lagarde, Mylene. “How to do (or not to do)… Assessing the impact of a policy change with routine longitudinal data.” Health policy and planning 27.1 (2012): 76-83. These are exaggerated examples meant to indicate the need for statistical analysis in time-bound studies of policy impact.

Notice with panel A, a typical two-sample statistical test might tell you that the average before and after the intervention are different. Even worse, you might conclude your intervention had the opposite effect intended, since the average after is larger than before. However, ITS would tell you that your intervention had no effect since neither the slope nor intercept of that plot changes when the policy goes into effect. On the other hand, with panel B, a two-sample statistical test might show no change with the intervention, but the slope of the line has changed, which ITS analysis would detect. In panel D, a statistical test might tell you that your intervention failed because the average increases, but ITS would also tell you that your slope is now negative, meaning that in the long-run the intervention is having the intended effect.

I skipped panel C, because it demonstrates the need for one of the more advanced techniques available for ITS. There appears to be seasonality in panel C. Luckily, as a time-series method, ITS, can explicitly address seasonality. Unfortunately, such examples do require quantitative statistical analysis and are often impossible to judge by eye.

The final advantage of ITS I want to emphasize are the visuals that come out of it. The figure below has the key components of an ITS chart. The actual data is plotted in faint red. There are separate fitted lines (which give the slope and intercept) before and after the intervention in solid red. The time of intervention is marked with a dotted vertical black line. And finally, in dotted red are the counterfactuals. These are the predictions of what would have happened without the intervention. The process of fitting these lines in statistical software (such as R or SAS) provides analysis of statistical significance for free. Essentially, you get a plot that tells your story intuitively that has the byproduct of indicating whether the changes you see are statistically significant.

Visualizing the results of ITS and segmented regression.

In this brief overview, I’ve avoided the mathematics and deeper details. I hope that this discussion provides insight into when ITS might be an appropriate analysis method. Performing segmented regression does require specialized software and specialized, technical knowledge. If you have a project where ITS seems appropriate, I would suggest reaching out to a Data Scientist for a consultation on what the best option is for your particular research project.

If you want to know more, Anesthesia and Analgesia has a thorough review of the method. For technical details, I would recommend this excellent edX course.

What is a random forest?

Random forests have left the… well, forests of machine learning esoterica and become an analysis technique in nearly every scientific field. In perioperative medicine, this technique has been used to automatically predict patient mortality more accurately than ASA (Hill and colleagues, 2019), blood transfusion needs (Jalali and colleagues 2020), hemorrhage detection (Pinsky and colleagues 2020), and prolonged length of stay (Gabriel and colleagues 2019). But what is a random forest?

Decision tree

To understand the forest, we must first understand the trees. At least, that’s what my meditation app told me this morning. It was a prescient comment because random forests are indeed made up of trees.

Imagine that I asked you if a patient about to undergo surgery is at higher risk of mortality. You might ask me questions like, “What is the patient’s ASA score?” Suppose I say, “2.” Then you might ask about the surgery they’re about to have. Or you might ask me about their age. With each answer, you’re improving your prediction about mortality. And each answer might inform the next question you ask. If I said ASA 3, your second question might be different than if I say ASA 1. Mentally, you are constructing a decision tree to answer my question.

As a visual example, consider if you are trying to decide what kind of room a patient should be placed in based on information from an electrocardiogram. You might use a decision tree like the following.

Decision tree example by SilviaCalvanelli. Used with permission under the creative commons CC-BY-SA-4.0 license. This is an illustrative example only. No actionable information is communicated in this figure. It should not be used by anyone for anything.

Now that you have a conceptual understanding of a decision tree, we can return to the idea of random forests.

Random forest

Think about the mortality risk question again. Instead of just asking one clinician, what if I got twenty together in a room and asked them all this question? Obviously, things might get messy if everyone starts asking me about the patient at once. So, in this imaginary scenario, I say that everyone gets to ask one question, and each person must tell me their prediction as soon as their question has been answered. I treat each person’s prediction like a vote, and once all the votes are in, I announce the winning prediction (high risk or not).

With these 20 clinicians each asking one question and getting one vote, I’ve constructed a random forest with 20 trees, each having a maximum depth of 1. If each clinician got to ask 2 questions, then the maximum depth would be 2, and so on. If I had 100 clinicians, then the forest would have 100 trees.

The take away from this example is that a random forest is a whole bunch – potentially thousands – of very simple models that are each allowed to ask only a few questions before making a prediction. The votes from all the simple models are then tallied, and the winning prediction is announced. The technique sometimes seems comically basic, but random forests have proved to be powerful prediction tools.


The example above certainly isn’t perfect. In a more precise example, the trees wouldn’t directly know each other’s questions, but they would be told whether their answer was improving the prediction based on cases where we already know the right answer, and they would be allowed to change their questions and votes accordingly. This process is called “model training.” Also, each tree wouldn’t start from a place of expert knowledge – like clinicians would – these would be more like randomly selected non-medical folks who were given a list of potential questions to choose from.  Finally, the final prediction doesn’t have to be a majority vote. As an alternative, each participating could say how confident they are in their answer, and then the average confidence could be used to pick the winning answer.


Hill, B. L., Brown, R., Gabel, E., Rakocz, N., Lee, C., Cannesson, M., Baldi, P., Olde Loohuis, L., Johnson, R., Jew, B., Maoz, U., Mahajan, A., Sankararaman, S., Hofer, I., & Halperin, E. (2019). An automated machine learning-based model predicts postoperative mortality using readily-extractable preoperative electronic health record data. British Journal of Anaesthesia, 123(6), 877–886.

Gabriel, R. A., Sharma, B. S., Doan, C. N., Jiang, X., Schmidt, U. H., & Vaida, F. (2019). A Predictive Model for Determining Patients Not Requiring Prolonged Hospital Length of Stay after Elective Primary Total Hip Arthroplasty. Anesthesia and Analgesia, 129(1), 43–50.

Pinsky, M. R., Wertz, A., Clermont, G., & Dubrawski, A. (2020). Parsimony of hemodynamic monitoring data sufficient for the detection of hemorrhage. Anesthesia and Analgesia, 130(5), 1176–1187.

Jalali, A., Lonsdale, H., Zamora, L. V., Ahumada, L., Nguyen, A. T. H., Rehman, M., Fackler, J., Stricker, P. A., & Fernandez, A. M. (2020). Machine Learning Applied to Registry Data. Anesthesia & Analgesia, Publish Ahead of Print(Xxx), 1–12.

More p-values, more problems

I’m a fan of the web comic XKCD by Randall Monroe, which takes esoteric math and science concepts and turns them into jokes. In one edition, Monroe tackles the issue of multiple hypothesis testing: If you test many hypothesis simultaneously without adjusting your significance cutoff (e.g., p<0.05), false positives are going to happen more than you might expect.

In the related edition of XKCD, two characters want to know if jelly beans cause acne. Scientists investigate this claim and find no link between jelly beans and acne. That is, the scientists test the null hypothesis, “There is no statistically significant relationship between jelly bean consumption and acne.” The results will not surprise you.

XKCD “Significant” panels. Used under creative commons license CC BY-NC 2.5.

In this joke example, the scientists test one hypothesis, calculating one p-value and comparing that one p-value to a critical value (here 0.05). No problem so far. But, what if I am concerned that one specific color out of a possible, say, 20 jelly bean colors causes acne?

XKCD “Significant” panels. Used under creative commons license CC BY-NC 2.5.

So green jelly beans cause acne? I can see the headlines now.

XKCD “Significant” panels. Used under creative commons license CC BY-NC 2.5.

Notice that part of this joke front-page news there is a comment “only 5% chance of coincidence.” Is that right? If the scientists had tested a single hypothesis, then yes. However, that’s not what happened. The scientists tested 20 hypotheses. So what are the odds this result happened by chance?

Remember that a p-value tells you the likelihood of getting a result as extreme or more extreme by chance. That means for a single hypothesis test, the p-value tell you the likelihood of getting something like your result by chance. What about if I tested 20 independent hypotheses with a cutoff of 0.05? In that case,

There is a 64% chance of at least one false positive. Said another way, it is more likely than not that this experiment will yield at least one false positive just by chance.

What do we do?

The simplest approach is to divide your cut off value by the number of simultaneous hypotheses (Miller 1981). This process is called a Bonferoni correction In this case, that would be

You may think, “That’s a very strict cutoff.” You’re right. This cutoff will do a great job of preventing false positives. In fact, we can prove it.

This number, 0.0488, can be thought of as the cut-off equivalent. If we were to somehow condense all 20 tests into 1, the cutoff for this test would be 0.488. However, as you might expect, this process results in more false negatives than would be expected from a single hypothesis test. In fact, you can prove that the false negative rate tends toward 1 as the number of tests increases (Efron 2004). That is, if you do a lot of simultaneous tests with this method, you’ll fail to reject the null hypothesis nearly every time, regardless of whether there is actually a relationship in your data.

The Bonferoni correction is still useful, though. If having even one false positive would mean disaster for your work, then the Bonferoni correction may be the way to go, as it is quite conservative. Likewise, if you are testing only a small number of hypotheses (say <25) and you expect (based on prior knowledge) that only one or two are true, the Bonferoni correction may be the way to go.

There are other options. I’ve linked to resources on a few below, but I would suggest reaching out to a Data Scientist for a consultation on what the best option is for your particular research project.

One option is to control the False Discovery Rate (FDR). For a practical guide to this process, see the section “Controlling the false discovery rate: Benjamini–Hochberg procedure” on this external blog post.  For a theoretical description, see Benjamini and Hochberg 1995.

Another option that has shown promise in Anesthesia and Analgesia literature (double meaning intended) is to re-frame your hypotheses to do joint hypothesis testing or a “gate keeping” procedure where you test a second hypothesis under the condition that the first is true (Mascha and Turan 2012).

You can also calculate Benjamini–Hochberg adjusted p-values, sometimes called q-values in some statistical software. However, these adjusted p-values lack the theoretical backing of typical p-values and methods for controlling FDR. They can be useful for quick-and-dirty analysis, but they should be avoided for publication purposes. A careful reviewer may object that such q-values lack statistical rigor. If you want to learn more about these adjusted p-values, these Berkeley lecture notes provide a succinct guide.


Munroe, Randall. “Significant.” XKCD.

Miller, R. (1981), Simultaneous Statistical Inference (2nd ed.), New York:

Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. Journal of the American Statistical Association, 99(465), 96–104.

Mascha, E. J., & Turan, A. (2012). Joint hypothesis testing and gatekeeping procedures for studies with multiple endpoints. Anesthesia and Analgesia, 114(6), 1304–1317.5

Benjamini, Yoav, and Yosef Hochberg. “Controlling the false discovery rate: a practical and powerful approach to multiple testing.” Journal of the Royal statistical society: series B (Methodological) 57.1 (1995): 289-300.

A risk scorecard for COVID-19

A product of the UAB COVID-19 Data Science Hackathon

On June 15 and 16, I participated in the UAB COVID-19 Data Science Hackathon. My teammates and I created a scorecard model and web app that predicts the likelihood of an individual being infected with COVID-19 based on demographics, pre-existing conditions, and symptoms. We were one of three teams to win a cash prize in the competition for the novelty and success of our work.

My teammates were Thi K. Tran-Nguyen, Tarun Karthik Kumar Mamidi, and Liz Worthey (who proposed the question we focused on). Together, our skills covered the gamut of data wrangling, model development, and medical relevance. As a result, we developed a proof of concept model and web app that we envision could be used by a broad population including both patients and providers. The end result is like looking at your credit score, except that it tells you if you are at higher risk of having COVID-19.

Scorecard web app demo prepared by Tarun Mamidi.

This work may be of particular interest to perioperative medicine for both specific patient screening for COVID-19 and risk stratification methods in general.

The hackathon showcase presentations were recorded on Friday, June 19, 2020. To view the presentation for this visit, click here.


I am Dr. Ryan Melvin, Ph.D., the department’s data scientist. My background is in physics and statistics, but my passion is figuring out what the heck data is trying to tell us.

Before UAB, I did predictive analytics (telling the future with math) for one of the top 10 US banks. They’re still in business, so I couldn’t have been that bad at it, right?

This departmental data science blog has two purposes. First, I want to provide examples of what statistical learning, machine learning, and data science in general can do for you. Through these examples, I also hope to peel back the curtain a little bit and demystify the black box of machine learning. Second, I want this blog to be an educational resource for statistical tools and appropriate use of machine learning. This second kind of post will range from pitfalls of those statistical tests you learned back in undergrad all the way to how much you should trust a cancer-spotting AI.

You may already have examples in mind of questions that traditional (sometimes called parametric) statistics can answer, such as “Do these data sets have different means?” or “Which of the three drugs had the greatest impact?”

The questions that machine learning algorithms do particularly well at answering look more like “I wish I could know if a patient is at higher risk for sepsis just one day in advance.” or “I wish I could know if ventilator asynchrony has occurred just 10 seconds after it happens.” You’ll have to forgive my lack of medical knowledge in those examples, but hopefully you get some idea of the form of the questions. There’s a piece of knowledge or prediction and a time frame within which you need that knowledge or prediction. If you have a question like this or even a question that might be better suited for traditional statistics, reach out to me, and we’ll get started!

If you have some extra time on your hands and want to know more about how machine learning can impact medicine and medical research, check out my talk on the subject here.

If you have quite a bit of extra time on your hands, allow me to make two book recommendations: Deep Medicine by Eric Topol and You Look like a Thing and I Love You by Janelle Shane. The title for that second book was generated by an AI, but the book itself was not. At least, I’m pretty sure it wasn’t.