The dataset consists of 907,464 members and their associated enrollment and services data. Processing the entire dataset using Pandas and Dask on my hardware, a Dell Inspiron laptop with 16 GB RAM and 4 cores, was not possible, causing memory allocation and out of disk space errors after minutes of processing. As a result, I decided to split the member dataset into groups of 100,000, in order of parquet file name in the original dataset. I called each group a "cohort". This resulted in 10 cohorts, the first 9 having 100,000 rows, and the 10th having the remaining 7,464 rows. I then analyzed each cohort separately and generated separate HTML reports for state, race, ethnicity, gender, and age comparisons, for a total of 10 reports per cohort, plus an additional report for providers by state. The analysis for each cohort takes several minutes but now could all run to completion on my laptop without crashing. With all 10x10+1 = 101 reports generated from the analysis, I wrote a web UI to display the reports. The user can select a specific cohort by number (from 1 to 10) or can expand all 10 cohorts to see their corresponding reports, or the providers report.
Category tags: