-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error result when voter file populations are located in zero population tracts #151
Comments
I've also run into this one. I excluded them from the sample I was trying to predict race for and added them back in with the state (or closest geography) race distribution for the extremely small sample of people that were living there. You should know too that the census data will be not be robust in those locations because of differential privacy. There's a recent paper that looks at this a bit too. https://www.science.org/doi/10.1126/sciadv.adl2524. You'll have to make a judgement call. Understanding this, how would you have expected the package to behave? |
I am shifting people to the nearest populated tract, but not sure how I feel about that. What you are doing sounds equally defensible. I tend to think of population sorting at the sub-county level as pretty important, so I don't like jumping up a scale, but it is certainly no better or worse than what I am doing. I am aware of the differential privacy stuff; I work with individual census responses and small geographies all the time and it is a mess at these scales. As for expected behaviour... if there is no information about race that can be generated from the census tract data, then I would default to the model prediction made without the benefit of the census data (e.g. name only) and report the number of records adjudicated in this way. Not sure how reasonable this is within the workflow of calculations though. Thanks for the quick response today and yesterday. |
fBISG defaults to national race distribution (which is uses as a prior). I don't know if that would be desirable behavior in some of these higher tract ranges. I think some of them also count military and national parks. We will have to think about this a bit. @solivella do you have an opinion/experience here? |
Equation 6 of https://www.science.org/doi/10.1126/sciadv.adc9824 implies that, in instances in which the census counts are zero (for whatever reason), the predicted probability will at least default to the name-only race probability. If the name is not in the dictionary, then the software (is expected to) default to the national race distribution, as @1beb mentioned. Can you confirm whether the |
I can't confirm explicitly as I don't see match status in the output while stepping through in debug mode. A few look like they might have one or two of their names missing, but most should have all three in the register. Not comfortable putting the example names here for all the world to see. |
Sorry, I should have been clearer, @csfowler. I wouldn't expect you to post names here. If you are not providing dictionaries of your own, you can
|
I think I can see where the expected behavior is not emerging. I can't do what you asked since predict_race isn't finishing to give me the result. Let me walk through it. The error is triggered by NA's in the race.init run of the BISG model where impute.missing is set to true. The stop message indicates it is probably bad geography, but I have checked for that. Prior to running the model I concatenate county and tract in my voter file (vf) to create a new column FIPS. I do the same thing with county and tract in 'cen' the object returned by summary( vf$FIPS %in% cen$FIPS) shows that all of my geographic units in the voter file can also be found in the census geography file I will use in my function call. Then I call: stepping through that function call in debug mode I can see that the vector returned from the internal call to predict_race (race.init) has 57 NA's in it. To be clear, this is the call to the BISG model, but it is run with impute.missing hard coded to TRUE, so that may be some lower level functionality not working as expected. Looking into the NA values, they are all generated by tracts that have zero population in cen. From here the code triggers a stop because of the NA's in race.init. If, still in debug mode, I manually change the 57 values to '1' then the function seems to proceed as expected (still running) Hope this helps. |
@csfowler this is super helpful. Thank you! Can I ask that you use
|
I get 62 TRUE on the table(miss_ind) and 119 TRUE on the table(!is.finite(preds$c_oth_last) the difference is the 57 problem values |
Thank you, @csfowler! This is exactly the information I needed to pinpoint the issue. I will have an update for you on this tomorrow (along with a PR to fix it). Please stand by. |
Thank you both. This is on it's way to CRAN. For now, you can use |
Close, but not quite. The race.init vector still triggers the stop condition because the results in pred.oth are not NA but Inf. (Haven't figured that out yet. When you do dplyr::coalesce( ) to get rid of the NA values it does not treat Inf as a missing value, and so the Inf is retained. Can I suggest a step such as here (https://stackoverflow.com/questions/12188509/cleaning-inf-values-from-an-r-dataframe) that converts Inf. to NA first? |
@csfowler just confirming that your original statement '...that breaks because the assigned race probability for other is NaN' is now incorrect, as you are seeing an |
Yes. At the very end of the function it was breaking because of a returned NaN, but within predict race, and specifically as an output of the function named earlier in the thread (get_census-new? I am on my phone and can’t see it) the value assigned to oth. Was Inf. so when you use coalesce it accepts inf. as a numeric value.Sent from my iPhoneOn May 29, 2024, at 4:18 PM, Santiago Olivella ***@***.***> wrote:
@csfowler just confirming that your original statement '...that breaks because the assigned race probability for other is NaN' is now incorrect, as you are seeing an Inf instead of an NaN. Is that right?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
pretty sure I have found the problem, and it might actually be a pretty important one with broader implications. |
In the above comment I figured out how to get census_helper_new to produce the expected outcomes. The predict_race function is still failing for me because of this section of code in predict_race_new: by the end of this chunk of code some of the rows in preds are all zeroes. impute missing doesn't do anything because it understands the 0's as numbers and then the preds/rowSums(preds) produces some NaNs then the voter.file object that is returned from this function has NaNs in it and causes a failure based on NaNs in the initial race values. It seems likely to me that the first chunk above will produce all zeroes with some regularity. It is actually amazing to me that this doesn't happen more often. With some unusual last names you will get a 0% likelihood for several races coming our of the census and surname dictionaries. If the person happens to have a first name that is also unusual, but for a different racial group then you get a row of zeroes. A proposed fix would be to have a check prior to the for loop in the if(impute.missing){ |
There are special tracts (often with tract id codes in the 98---- range) that denote low population areas like airports, water bodies, parks, etc (census documentation here https://www2.census.gov/geo/pdfs/partnerships/psap/G-650.pdf). People can still live here (and they do), but especially with differential privacy there is a good chance these tracts show as zero population. I am running a voter file that breaks because the assigned race probability for other is NaN. Essentially the zero value gets passed all through the other racial categories and then breaks when the pr.other value is being calculated. To be clear, these are people located within valid census geographies, so skip.bad.geos doesn't move past them. I have to believe that this happens with some regularity at the smaller geographies as well. I would hope that the non-exclusion component that gets added in the fBISG framework would handle this possibility, but it seems to break on this use case. I am running into the problem using North Carolina data from L2 which I think was used to test the package initially anyway, so maybe a solution already exists?
The text was updated successfully, but these errors were encountered: