QUESTION
How are datasets prioritized and publicized, and why are some removed?
0:29:48
·
5 min
Zachary Feder explains the process and rationale for dataset prioritization, publication, and removal in NYC's open data portal.
- Datasets are prioritized based on public demand and agency requests, with changes communicated transparently.
- A "dataset of datasets" exists to inform the public about datasets planned for release and those removed, including reasons for removal.
- Reasons for dataset removal include reducing redundancy, merging related datasets, and discontinuing programs to which datasets pertain.
- Historical datasets are preserved with metadata indicating their status, contributing to clear communication about active versus historical data.
Jennifer Gutiérrez
0:29:48
Can you share a little bit of who is in charge of this process and how how you determine your prioritization process, and does the public have access to how datasets are prioritized?
Martha Norrick
0:30:01
Can I answer those?
Zachary Feder
0:30:02
Sure.
0:30:03
So the first way the public has access to how datasets are prioritized is we have actually a dataset that tells everyone what is going to get released.
0:30:13
Yeah.
0:30:13
The data sets all the way down.
0:30:14
Yeah.
0:30:16
And if anything changes, So let's say there's a shift where more work is needed than was expected.
0:30:24
That rationale gets shared alongside of the date when that that change a change happened to what the new date is.
0:30:32
So that is all transparent for for anyone to see.
0:30:35
As far as the prioritization, there's a couple of factors at play.
0:30:40
1 is based on demand.
0:30:43
So one of the things I I think is enshrined in the law we have in our technical standards manual and very much encourage agencies to do is where there is a topic that they're getting more requests for, that they have maintained data about.
0:30:57
Sometimes there's a topic where there is not that dataset in existence.
0:31:02
That that dataset is something that will encourage them to to prioritize for release.
0:31:09
Oftentimes, for for agencies that that's also a a practical and and, like, just logistically easier because they're otherwise releasing that data set to the public.
0:31:20
Via foil.
0:31:21
So instead of fielding dozens dozens of foil requests for a dataset, they could put it up on open data and anyone can grab it whenever they want.
0:31:30
Much more easily.
0:31:32
Ultimately, the the prioritization and and just the publication of data We work with agencies, but it's not our data.
0:31:39
It's theirs.
0:31:40
They know it far better than we do.
0:31:42
And so we we also we we return to them for for that prioritization just based on their other the other things they're working on, and the internal knowledge of those systems and and what that data entails.
Jennifer Gutiérrez
0:31:56
And what why are some datasets removed from the portal?
0:32:04
Could you expand a little bit about when that happened?
0:32:07
Do you announce it?
0:32:09
And do you believe that those data seats that datasets that are removed should still be available?
Zachary Feder
0:32:17
So, I I guess, to to the to to the question if we announce the datasets, you'll be Yeah.
0:32:23
Why do you
Jennifer Gutiérrez
0:32:23
remove them?
0:32:24
Why do you remove them?
0:32:25
It's June oh, I'm sorry.
0:32:26
I thought you
Zachary Feder
0:32:26
were No.
0:32:26
No.
0:32:27
No.
0:32:27
No.
0:32:27
You you'll be surprised, June is actually a dataset of datasets that have been removed.
0:32:31
So there there really is a dataset for everything.
Jennifer Gutiérrez
0:32:35
That makes sense.
0:32:35
That totally
Zachary Feder
0:32:36
And with that actually, is an explanation of why the dataset was removed.
Jennifer Gutiérrez
0:32:41
Oh, okay.
Zachary Feder
0:32:41
We have recently been focusing actually on removing a lot more datasets.
0:32:46
We used to talk a lot about the number of datasets we have.
0:32:49
And when you're first starting a public data program, it's important to add more and more datasets.
0:32:54
After a while, the number of datasets, I think, we're currently sitting somewhere around 35100.
0:33:00
It becomes difficult for someone to confirm that they have the right data center to find the data they're looking for.
0:33:05
So one of the things we're doing with agencies right now, when we get a new data set.
0:33:10
We are looking at what they've published already.
0:33:12
We're looking at the totality of of what they are planning to share and encouraging them as much as possible to take that data and basically share it together.
0:33:22
So let's say if there's a dataset from 1 year and we get new data from a year, we're not gonna publish as a new dataset anymore.
0:33:30
We'll take that and just have an ongoing dataset across different years.
0:33:34
Or if we have related programs that follow a similar schema, a similar structure, those would also be be combined.
0:33:42
Other reasons that datasets get removed.
0:33:45
Sometimes it's because the program no longer exists anymore in the same way that it did.
0:33:51
So there were some, let's say, city efforts during the COVID, the height of the COVID pandemic, let's say around social distancing, that there was data that was being collected on.
0:34:01
And some of that data is no longer active.
0:34:04
But we will frequently preserve data or almost in almost every case, we'll preserve these data sets as historical data.
0:34:11
Okay.
0:34:11
So changing the title to indicate this is not something that's ongoing.
0:34:14
There's a element for each data set that metadata element that tells you, like, what to expect for how often it's updated, and that will also be marked as historical.
0:34:25
Again, just literally what we're focusing on is, like, communicating clearly of, like, what's active and what's not and and trying as much as possible to have what's available actually meet those expectations.
Jennifer Gutiérrez
0:34:36
And is that historical data still living?
0:34:39
Yep.
0:34:40
Okay.
Zachary Feder
0:34:40
Yeah.
0:34:40
You could still see it, but it just is not.
Jennifer Gutiérrez
0:34:42
Yeah.
0:34:43
Totally.
0:34:43
Okay.
0:34:44
Great.
Zachary Feder
0:34:44
Not actively updated.
0:34:45
That's great.
0:34:45
Those are those are the majority of reasons why we're removing datasets.