Understanding Data Volume in Reports
At ACRE Consulting, one of the things we do a lot of is writing reports. We see a lot of different request for reports which vary in complexity from extremely simple to crazily complex. The resulting reports end up varying wildly in measures like efficiency and run-time. Some will run quickly, while others might take ages to compile (fill with data) and might even time-out (give up before completion).
We'd like to walk you through what's going on in the background of a report so you can have a better understanding of what might make your report run quickly or slowly.
Number of Records
The first and biggest factor is, of course, the number of records that your report will be pulling. You can think of a record like a row in an Excel sheet. An Excel sheet with a million rows is going to take a lot longer to load than one that just has ten rows, and the same is true for reports.
Let's imagine that you have a moderately small database, with about 1,000 clients in it. If you run a report that just asks for the names of every client, then that's going to give you 1,000 records - one for each client. Now generally speaking, a report that is pulling 1,000 records is going to run very quickly. Your server speed will vary but let's say, for the sake of argument, it takes 0.0005 seconds to pull each record. 1,000 * 0.0005 seconds is a total run-time of 0.5 seconds to pull the data. A thousand records might sound like a lot, but computers are good at computing, so a thousand records is not particularly challenging for an average server to handle.
You're rarely just going to be getting a list of all your clients' names, however. More commonly, you might want a report that looks at bed-nights, or clients served in a particular date range or something like that. So we're not looking at 1,000 records anymore. Instead, let's imagine that of those 1,000 clients, there were a total of 3,000 stays in shelter in the past year. If you want to count bed nights, it's going to include all of those 3,000 shelter stay records, which means the report is going to take three times as long to run (there's three times as much data).
Importantly, this is the case even if you don't see all 3,000 records! If you have a report that just spits out year-end totals, like "the total number of bed-nights this year was 20,000," the report still needed to look at all 3,000 records and count the nights for each of them.
If you have a particularly large database - and we know of at least one community in Canada with close to 100,000 clients in it and close to 1,000,000 stay records - you're probably going to want to narrow down the number of records you're looking at. You can do this by using parameters such as specifying a date range and specifying which service providers you want to look at. You may have 1,000,000 stay records in your database, but if you only want to look at stays at one shelter in the past month, that's going to be a much smaller number of records, and your report will run much more quickly.
Formulas and Calculations
Many reports include calculations of some sort. In the example above using bed-nights, calculating the number of bed-nights is itself a calculation. The data that's stored in the database is just the book-in date and the book-out date, so in order to calculate how many nights that was, we need to do some basic math. It's not a particularly complex formula - we're just looking at the difference between the start date and the end date - but the computer takes time to make that calculation. Even if it takes 0.0005 seconds to perform the calculation, the computer needs to perform that calculation once for every record. With 1,000,000 records, it would take more than 8 minutes to perform the calculations (once for each record), and that's on top of the time that it would take to retrieve the records.
So as you can see, these start to add up. Let's say that in addition to the bed-nights, you want your report to tell you the age group of each client. There is a data field that contains the client's age; now we need a calculation that's going to put it into a category of child, youth, adult, or senior. Now, for each record, we need to calculate both the client's age category and the bed nights for the stay.
Let's say that it takes 0.0005 seconds to retrieve each record, and then 0.0005 seconds to perform a calculation on it. There's two calculations, so it's now a total of 0.0015 seconds to handle each record. Hypothetically, our report that just pulled the 3,000 stays would take 1.5 seconds to run, but if we add in two formulas that will be calculated for each record, the report now takes 4.5 seconds to run.
Multiple Data Sources
One of the things that significantly increases run-time is when you'd like to include data from different sources. For example, we want to include our shelter stays, but we also want to see the most recent assessment score for our clients and their sources of income.
Let's imagine, just for a second, that of our 1,000 clients, on average, each of them has 2 different sources of income. Some might have none and some might have 5 or 6, so on average we're going to go with 2. We're now adding records in. Each client has 2 extra, additional records that come along with them, so we have 3 records we're including for each client, and then also each client might have multiple stays. If we also include their most recent assessment score, we're adding one more additional record. So each client file has, effectively, 4 records (on average). Instead of being 1 piece of data, it's 4, so we can multiply everything by 4. Our 3,000 stays becomes 12,000 records.
Confused? Let's use an example. We're talking about a report that's going to look like this:
Sources of Income
Most Recent Assessment
|VI-SPDAT Score: 7
Our client Peter has two sources of income. They're "Provincial Social Assistance" and "Money from Friends/Family." Peter also has a VI-SPDAT record. For each stay, our report is going to need to do the following:
- Retrieve the stay record
- Calculate the number of bed-nights
- Retrieve the client record
- Calculate the client's age group
- Retrieve the client's first source of income
- Retrieve the client's second source of income
- Retrieve the client's most recent assessment
Because we've added in additional data sources in addition to our calculations, we're increasing the number of steps it takes for the computer to be able to deal with each record. Now there are 7 steps per record, and 3,000 records, and 0.0005 seconds per step, for a total of 10.5 seconds to compile the report.
Ten seconds to run a report is not that long, but it is starting to get noticeable. You'll see a loading screen and the loading circle will spin a few times before you see your report generated. And this report is not even particularly complicated, nor is it including a large amount of data. Larger communities could easily have 3,000 stay records per month rather than per year.
Special Discussion: Chronic Homelessness
It's worth noting here how chronic homelessness comes into play, since it's a data point of interest for many. At the time of writing, there's no easy way to just check a single field and find out if the client is homeless or not. The HIFIS 4 software does not calculate automatically whether a client is chronically homeless; a report would have to do that as its own calculations. On the surface, this seems like just one more calculation that would add a little to the report load time but not an unreasonable amount. However, let's dive a little deeper, shall we?
The typical definition of chronic homelessness is experiencing 6+ months of homelessness in the past 1 year, or 18+ months of homelessness in the past 3 years. In order to calculate that, we would look back at the client's history for the past one and three years and check to see if they had accumulated enough experiences of homelessness to be counted as chronic. That means we would ask the computer to look up all the client's previous shelter stays, and their housing history, going back several years. The computer pulls up a stay record then checks whether it was within the past three years then if so calculates the number of days for that stay then adds it to a running total and moves on to the next record. Some individual clients could have hundreds of records attached to their file within that range! So to calculate whether a client is chronically homeless once might mean adding a hundred new calculations! (Strangely, the report would calculate faster if your shelters have fewer, longer stays, as opposed to booking people in and out every day... but that's just a side note.)
And then, on top of that existing demand, chronic homelessness is a dynamic state that has to do with our perspective - when are we looking at the data? You could serve a client in January, and the client is not chronically homeless, and then serve the client again in April, and at that point they are chronically homeless. So when writing a report, when we include a field that would indicate whether a client was chronically homeless or not, we would ask chronically homeless as of when? If you were running the report in April you'd probably want the client to be listed as chronically homeless, and if you were running the report in January you'd probably want them to be listed as not chronic. Usually, you'd want to know if they're chronically homeless as of the report range.
But what if you were running a report for the whole year? Of all the clients we served last year, how many were chronically homeless? Well, it would make sense for our aforementioned client to be considered both not chronic (in January) and chronic (in April), which opens a whole new can of worms. The most accurate thing we could do would be to check every single day whether the client was chronically homeless on that day, which allows for a client to "age in" to chronicity. But that means you'd be running the previous hundred new calculations per client... three hundred and sixty five times!
Let's say, conservatively, that every client has, on average, only 5 (not 100) previous records of homelessness. If you wanted to check every single day to find out whether your clients were chronically homeless, and you hypothetically had 1000 clients, you'd need to run 1,000 clients * 5 records * 365 days = 1.8 million calculations. At 0.0005 seconds per calculation, that's 15 minutes to compile the report!
Please bear in mind that our estimated execution times for all of the calculations above were purely hypothetical, but we wanted to demonstrate how the various components in a report start to add up. This is by no means an exhaustive list, either. Adding in charts or graphs are also going to contribute to slowing down your reports, and there's plenty of other things that we didn't even mention that will have an impact.
Does this all mean that you shouldn't have complex reports? Good heavens, that's not what we're saying at all. Absolutely you should be extracting the data that you need. But you'll be better off if you have a good understanding of how complex your report is, and the things you can do to improve your report's efficiency.
The biggest piece of advice we can give is that a bunch of smaller reports is going to work a lot better than one giant mega-report. That's why, for example, we broke up the Community Homelessness Prevention Initiative reporting requirements into six different pieces!
Another thing you can do is look at the data that you want to include on a particular report and ask if it's necessary to be included on that particular report. Hypothetically, you might have a report that looks at current caseload. It may be tempting to include additional information about each client such as their address, their assessment scores, their sources of income, their contributing factors, and so on, but if you want a report that's going to run efficiently, ask yourself what information needs to be on this specific report?
It's perfectly reasonable to have two similar reports that are for different audiences. For example, instead of one master caseload report, you might have two different ones: one that includes practical information like case notes, upcoming appointments, and target move-in dates, to be used by the caseworker; and a second that has high-level information such as expended time and assessment score changes over time, to be used by supervisors.
In terms of chronic homelessness, ask yourself what the data is going to be used for. If you only need to know who is chronically homeless as of today, in order to operationalize your By-Name List, that's great. You should have a report that does that, and very little else. If you need to know who became chronically homelessness last July, we can make a report that does just that. But don't try to have a giant master report that does everything all at once.
Try to be clear about the purpose of a report, while you're thinking about designing one, and if you're concerned that your report might be too slow or too cumbersome, try streamlining it by removing the data points that you don't need!