Hackathon Data Access and Usage Guide
Welcome to the hackathon! This guide will help you understand the structure of the data provided, how it is organized, and how to efficiently process it for your projects.
You can request access to the data here
Overview of the Data
The dataset provides a comprehensive view of healthcare operations, focusing on member demographics, enrollment details, healthcare services, and provider information. It captures various interconnected aspects of the healthcare system, including patient enrollment in plans, the services they use, and the providers delivering those services. This granular data enables participants to analyze patterns, optimize healthcare costs, and explore equity in service delivery. The dataset is divided into several key components, each offering unique insights into the healthcare journey.
Structure and Size
- Total Data:
- Approximately 450,000 members with 37.5 million records, divided into two states: Hawaii and California.
- File Size:
- Compressed ZIP: ~2 GB.
- Uncompressed Parquet Files: ~3.5 GB.
- If converted to CSV: The total size may grow to ~10 GB.
- Subfolder Organization:
- Each dataset (e.g., Members, Enrollment, Services) is stored in a subfolder containing one or more Parquet files.
- Each subfolder also includes two utility scripts:
convert_to_csv.py
: Converts all Parquet files into one CSV file.truncate_rows.py
: Truncates the dataset to a smaller, manageable size.
Dataset Relationships
The data consists of several interconnected datasets that provide a comprehensive picture of patient care, member information, and healthcare costs. Here’s how they are related:
1. Members and Enrollment
- Key Relationship:
PRIMARY_PERSON_KEY
,MEMBER_ID
- Each member is identified by a unique key (
PRIMARY_PERSON_KEY
orMEMBER_ID
). - These keys link demographic information (age, gender, race, etc.) in the
Members
dataset to theEnrollment
dataset, providing details about when members started their healthcare plan and how long they have been enrolled.
2. Members and Services
- Key Relationship:
PRIMARY_PERSON_KEY
,MEMBER_ID
- The same unique keys (
PRIMARY_PERSON_KEY
orMEMBER_ID
) link members from theMembers
dataset to the healthcare services they’ve used, stored in theServices
dataset. This allows tracing which member received specific medical services (e.g., doctor visits or treatments).
3. Services and Providers
- Key Relationship:
PROV_KEY
- The
Services
dataset is connected to theProviders
dataset using thePROV_KEY
, which identifies the healthcare providers who delivered the services. Providers could be individual practitioners, clinics, or hospitals, and their information includes type, location, and other attributes.
4. Overall Relationship
- The datasets are interconnected in the following way:
- Members → Enrollment → Services → Providers
- These relationships allow for comprehensive analyses of:
- Who are the members receiving care, and what are their demographics?
- What services are being used, and how much do they cost?
- Where are services being delivered, and by which providers?
File Structure Example
Below is an example of how the data and scripts are organized:
Hack_Lokahi/
├── Claims_Enrollment/
│ ├── part-00000-tid-<id>.parquet
│ ├── part-00001-tid-<id>.parquet
│ ├── ...
│ ├── convert_to_csv.py
│ ├── truncate_rows.py
├── Claims_Member/
│ ├── part-00000-tid-<id>.parquet
│ ├── convert_to_csv.py
│ ├── truncate_rows.py
├── Claims_Provider/
│ ├── part-00000-tid-<id>.parquet
│ ├── convert_to_csv.py
│ ├── truncate_rows.py
├── Claims_Services/
│ ├── part-00000-tid-<id>.parquet
│ ├── part-00000-tid-<id>.parquet
│ ├── part-00001-tid-<id>.parquet
│ ├── ...
│ ├── convert_to_csv.py
│ ├── truncate_rows.py
How to Work with the Data
1. Parquet Files
We recommend working directly with the Parquet files, as they are optimized for storage and processing. Here’s an example of how to load and explore a Parquet file:
import pandas as pd
# Load a single Parquet file
df = pd.read_parquet('path_to_your_file.parquet')
# Display the first few rows
print(df.head())
# Perform basic operations
print(df.info())
2. Convert Parquet to CSV
If you prefer working with CSV files, you can use the convert_to_csv.py
script included in each subfolder.
Steps:
- Navigate to the
Hack_lokahi
folder. - Adjust the path if needed.
- Run the script:
python convert_to_csv.py
- The script will:
- Combine all Parquet files in the subfolder into one CSV file.
- Save the CSV file in the same folder.
3. Truncate Data for Smaller Files
If the data is too large for your use case, you can truncate it using the truncate_rows.py
script.
Steps:
- Open the script and set the desired number of rows (
N_ROWS
). - Run the script in the desired subfolder (e.g.,
Claims_Services
):python truncate_rows.py
- The script will:
- Combine all Parquet files.
- Create a truncated version (both Parquet and CSV formats) in a new subfolder (e.g.,
truncated_Claims_Services
).
Dataset Details
Below is a detailed description of each dataset. Replace the placeholders below with the specific tables for each dataset.
Member Enrollment
Column Index | Column Name | Description and Codes | Format |
---|---|---|---|
1 | PRIMARY_PERSON_KEY | Unique person indicator | String |
2 | MEMBER_ID | Anonymized PII token | String |
3 | MEMBER_MONTH_START_DATE | Indicates the first of the month of the enrollment record | Date |
4 | YEARMO | Enrollment month in the format YYYYMM | String |
5 | MEM_AGE | The member's age | String |
6 | RELATION | Indicates member's relationship to subscriber (e.g., Subscriber, Spouse, Dependent) | String |
7 | MEM_MSA_NAME | Member residence Metropolitan Statistical Area (MSA) | String |
8 | PAYER_LOB | Line of Business: COMMERCIAL, MEDICARE, MEDICAID, or MEDICARE_SUPPLEMENT | String |
9 | PAYER_TYPE | Type of enrollment plan. See details in the 'Payer Type Detail' tab | String |
10 | PROD_TYPE | Coverage type: "Medical", "Dental", "Vision", "RX", or other contributor-defined values | String |
11 | QTY_MM_MD | Number of months of member coverage for medical services | Float |
12 | QTY_MM_RX | Number of months of member coverage for prescription drug services | Float |
13 | QTY_MM_DN | Number of months of member coverage for dental services | Float |
14 | QTY_MM_VS | Number of months of member coverage for vision services | Float |
15 | MEM_STAT | Employment benefit status at the time of service (ACTIVE, COBRA, RETIREE) | String |
16 | PRIMARY_CHRONIC_CONDITION_ROLLUP_ID | Chronic Disease Group ID | Integer |
17 | PRIMARY_CHRONIC_CONDITION_ROLLUP_DESC | Chronic Disease Group Description | String |
Members
Column Index | Column Name | Description and Codes | Format |
---|---|---|---|
1 | PRIMARY_PERSON_KEY | Unique person indicator | String |
2 | MEMBER_ID | Anonymized PII token | String |
3 | MEM_GENDER | Member's gender (F, M, or U) | String |
4 | MEM_RACE | Race of the member (1 = Asian, 2 = Black, 3 = Caucasian, 4 = Other/Unknown) | String |
5 | MEM_ETHNICITY | Ethnicity (1 = Hispanic, 2 = Not Hispanic, 3 = Unknown) | String |
6 | MEM_ZIP3 | First three digits of the member's residence ZIP code | String |
7 | MEM_MSA_NAME | Member residence Metropolitan Statistical Area (MSA) | String |
8 | MEM_STATE | State of residence as defined in the source eligibility record | String |
Services Care Cost
Column Index | Column Name | Description and Codes | Format |
---|---|---|---|
1 | PRIMARY_PERSON_KEY | Unique person indicator | String |
2 | MEMBER_ID | Anonymized PII token, where PII is available | String |
3 | CLAIM_ID_KEY | Unique internal claim identifier, unlinkable to any external claim record source | Decimal(38,0) |
4 | SERVICE_LINE | Unique number within a claim identifying a unique service line item rendered | Integer |
5 | SV_STAT | Status of claim: P=Paid, D=Denied, V=Voided (line never paid), O=Open/Pending, R=Reversed, E=Encounter (statistical claim from cap contract) | String |
7 | BILL_PROV_KEY | Internal identifier for the billing entity | String |
8 | REF_PROV_KEY | Internal identifier for the referring, ordering, or prescribing entity, if available | String |
9 | ATT_PROV_KEY | Internal identifier for the servicing/attending entity, if available | String |
10 | YEARMO | Service month in the format YYYYMM | String |
11 | FROM_DATE | The first detail date of service as supplied on detail lines of a claim | Date |
12 | TO_DATE | The last detail date of service as supplied on detail lines of a claim | Date |
13 | PAID_DATE | Date the claim line is considered paid by the plan for general ledger purposes | Date |
14 | ADM_DATE | Admission date at the day level | Date |
15 | DIS_DATE | Discharge date at the day level | Date |
16 | AGE_ON_DOS | The age of the member as of the date of service of the claim line | String |
17 | RELATION | Indicates the member's relationship to the subscriber (Subscriber, Spouse, Dependent) | String |
18 | PAYER_LOB | Line of Business: COMMERCIAL, MEDICARE, MEDICAID, MEDICARE_SUPPLEMENT | String |
19 | PAYER_TYPE | Type of enrollment plan member is covered under. See values in the 'Payer Type Detail' tab | String |
20 | MEM_STAT | Employment benefit status of the member at the time of service (ACTIVE, COBRA, RETIREE) | String |
21 | CLAIM_IN_NETWORK | Indicates if claim was paid as in-network or out-of-network | String |
22 | SERVICE_SETTING | Service setting based on a proprietary methodology (e.g., Inpatient, Outpatient, Professional, Rx, Other) | String |
23 | FORM_TYPE | Type of claim: U=UB, H=CMS1500, D=Prescription Drug, A=Dental, V=Vision, L=Lab | String |
24 | UB_BILL_TYPE | Industry standard codes | String |
25 | POS | Industry standard place of service code | String |
26 | MS_DRG | MS-DRG code assigned by the MS-DRG grouping software | String |
27 | REV_CODE | Revenue code and description | String |
28 | PROC_CODE | Procedure code and name | String |
29 | CPT_MOD_1 | Procedure code modifier 1 | String |
30 | CPT_MOD_2 | Procedure code modifier 2 | String |
31 | CPT_CCS | Procedure Clinical Classification Software categories description | String |
32 | CPT_CCS_LABEL | Procedure Clinical Classification Software categories description | String |
33 | ADM_SRC | Industry Standard Admit Source | String |
34 | ADM_TYPE | Industry Standard Admit Type | String |
35 | DIS_STAT | UB Discharge Status: Industry standard codes | String |
36 | ICD_DIAG_ADMIT | Admitting ICD diagnosis code | String |
37-66 | ICD_DIAG_01 - ICD_DIAG_30 | ICD diagnosis (1 Primary Diagnosis, 2 - 30 Secondary). Includes all relevant codes. | String |
67-96 | ICD_DIAG_01_POA - ICD_DIAG_30_POA | ICD diagnosis present at time of admission: Y=Yes, N=No, U=Unknown, W=Undetermined, 1=Exempt from POA reporting | String |
97-126 | ICD_PROC_01 - ICD_PROC_30 | ICD Surgical Procedure Codes | String |
127 | DIAG_CCS_1_LABEL | Multi-level clinical category description from AHRQ Clinical Classifications Software (CCS) for the primary diagnosis | String |
128 | DIAG_CCS_2_LABEL | Multi-level 2 clinical category description from AHRQ Clinical Classifications Software (CCS) | String |
129 | DIAG_CCS_3_LABEL | Multi-level 3 clinical category description from AHRQ Clinical Classifications Software (CCS) | String |
130 | NDC_CODE | National Drug Code | String |
131 | RX_DAYS_SUPPLY | Prescription days' supply | Float |
132 | RX_REFILLS | Number of prescription refills | Smallint |
133 | RX_FILL_SRC | Drug dispense source: R=Retail, M=Mail, U=Unknown | String |
134 | RX_DRUG_COST | Cost of the drug in dollars | Float |
135 | RX_INGR_COST | Cost of the ingredients in the drug | Float |
136 | RX_QTY_DISPENSED | Quantity of drug dispensed | Decimal(18,2) |
137 | RX_DISP_FEE | Prescription dispensing fee | Decimal(18,2) |
138 | RX_FORM | Indicates if drug dispensed was formulary: 1=Formulary, 0=Non-formulary | Integer |
139 | AMT_BILLED | Charge amount (billed) is the amount submitted by the service provider | Decimal(18,2) |
140 | AMT_ALLOWED | Negotiated allowed amount after discounts | Decimal(18,2) |
141 | AMT_COB | Amount paid from another payer (e.g., auto claim, workers' comp) | Decimal(18,2) |
142 | AMT_COPAY | Copay amount the individual is responsible for | Decimal(18,2) |
143 | AMT_DEDUCT | Deductible amount the member pays to the provider | Decimal(18,2) |
144 | AMT_COINS | Coinsurance amount the individual is responsible for | Decimal(18,2) |
145 | AMT_PAID | Amount paid to the provider by the plan | Decimal(18,2) |
146 | AMT_DISALLOWED | Amount not allowed to be paid to the provider | Decimal(18,2) |
147 | SV_UNITS | Number of units dispensed or administered | Decimal(12,2) |
148 | DIAGNOSTIC_CONDITION_CATEGORY_ID | Diagnostic condition category | String |
149 | DIAGNOSTIC_CONDITION_CATEGORY_DESC | Diagnostic condition category description | String |
Providers
Column Index | Column Name | Description and Codes | Format |
---|---|---|---|
1 | PROV_KEY | Unique provider ID | String |
2 | PROV_TYPE | Provider type (Practitioner, Organization) | String |
3 | PROV_NPI_ORG | National Provider Identifier (NPI) | String |
4 | PROV_CLINIC_STATE | State of the provider's clinic | String |
5 | PROV_CLINIC_ZIP | ZIP code of the provider's clinic | String |
6 | PROV_TAXONOMY | Provider specialty taxonomy code | String |
Key Recommendations
- Use Parquet Files: These are efficient for both storage and processing.
- Mind the Size:
- The uncompressed data is ~2GB, but converting to CSV can increase it to ~10 GB.
- Use the truncation script for smaller datasets.
- Scripts Are Pre-Included:
- Each subfolder already includes
convert_to_csv.py
andtruncate_rows.py
.
- Each subfolder already includes
Example Potential Applications
1. AI-Powered Healthcare Cost Analyzer
- Objective: Build an interactive tool to analyze and visualize healthcare costs for patients and providers.
- Approach:
- Use the
Services Care Cost
dataset to create a dashboard that shows:- Cost trends by service type (
PROC_CODE
,SERVICE_SETTING
). - Differences between in-network and out-of-network claims (
CLAIM_IN_NETWORK
). - Visualizations of average costs by provider (
PROV_KEY
).
- Cost trends by service type (
- Include an AI model to predict future costs for a patient based on past claims.
- Use the
- Technologies:
- Python (Streamlit for the interface).
- Machine learning (time series forecasting with ARIMA or Prophet).
- Impact: Provides actionable insights into healthcare spending patterns for patients and insurers.
2. Claim Denial Prediction System
- Objective: Create a machine learning model to predict claim denial and suggest ways to prevent it.
- Approach:
- Use the
SV_STAT
column in theServices Care Cost
dataset to train a classification model (e.g., predictD
for denied claims). - Analyze the reasons for denials by correlating them with
SERVICE_SETTING
,FORM_TYPE
, andPOS
. - Add a feature to flag high-risk claims for further review.
- Use the
- Technologies:
- Scikit-learn or XGBoost for model training.
- Flask or FastAPI to serve the predictions.
- Impact: Helps insurers and providers reduce claim rejections, saving time and money.
3. Provider Performance Dashboard
- Objective: Rank providers based on cost-effectiveness and service outcomes.
- Approach:
- Use
Services Care Cost
andProviders
data to calculate metrics like:- Average cost per claim.
- Service setting distribution (e.g., inpatient vs. outpatient).
- Claim approval rates (
SV_STAT
).
- Visualize rankings on an interactive map using provider ZIP codes (
PROV_CLINIC_ZIP
).
- Use
- Technologies:
- Tableau or Python (Dash/Streamlit).
- Geospatial libraries (folium or Plotly Maps).
- Impact: Enables users to make informed choices when selecting healthcare providers.
4. AI-Powered Personalized Health Plan Selector
- Objective: Recommend the best health insurance plan for users based on their healthcare usage history.
- Approach:
- Analyze historical claims data (
Services Care Cost
) and demographic information (Members
) to determine typical service needs. - Use clustering to group users with similar usage patterns.
- Recommend plans (
PAYER_TYPE
) based on predicted costs for medical, dental, vision, and prescription services.
- Analyze historical claims data (
- Technologies:
- K-Means for clustering.
- Streamlit for a recommendation interface.
- Impact: Makes healthcare more accessible by simplifying the decision-making process.
5. Fraud Detection in Claims
- Objective: Build a model to detect suspicious claims and prevent fraud.
- Approach:
- Identify anomalies in
AMT_BILLED
,PROC_CODE
, andSERVICE_LINE
. - Use unsupervised learning (e.g., autoencoders or isolation forests) to detect irregular patterns in claims.
- Highlight unusual cases in a dashboard for further investigation.
- Identify anomalies in
- Technologies:
- PyTorch/TensorFlow for anomaly detection.
- Dash/Plotly for visualization.
- Impact: Helps insurers save money by flagging fraudulent or inflated claims.
6. Health Equity Explorer
- Objective: Build a tool to identify disparities in healthcare access and outcomes.
- Approach:
- Use
Members
data (race, ethnicity, ZIP code) and claims data (Services Care Cost
) to analyze:- Differences in service usage.
- Costs and outcomes across demographic groups.
- Geographic disparities in access (
MEM_STATE
,PROV_CLINIC_STATE
).
- Visualize insights in an interactive dashboard.
- Use
- Technologies:
- Python for analysis (Pandas, Matplotlib).
- Dash for interactive visualizations.
- Impact: Informs policymakers and organizations about gaps in healthcare access.
7. Prescription Cost Optimizer
- Objective: Recommend cost-effective prescription options for patients.
- Approach:
- Use
RX_DRUG_COST
,RX_QTY_DISPENSED
, andRX_FORM
to find trends in drug pricing. - Suggest alternatives (e.g., generics vs. brand-name drugs) based on cost and formulary status (
RX_FORM
). - Include visualizations of drug costs by provider or pharmacy.
- Use
- Technologies:
- Python (Streamlit for UI).
- Machine learning for cost prediction and recommendations.
- Impact: Reduces prescription costs for patients and insurers.
8. Predictive Chronic Disease Risk Tool
- Objective: Predict a member’s risk of developing chronic diseases based on past claims and demographic data.
- Approach:
- Train a model using
PRIMARY_CHRONIC_CONDITION_ROLLUP_DESC
, service history (PROC_CODE
), and demographic data (MEM_AGE
,MEM_RACE
). - Provide risk scores and recommend preventive measures.
- Train a model using
- Technologies:
- Gradient Boosted Trees (e.g., XGBoost, LightGBM).
- A simple web interface using Flask or Streamlit.
- Impact: Improves health outcomes by enabling early interventions.
9. AI-Driven Service Line Prediction
- Objective: Predict the next likely service line a member might use based on their history.
- Approach:
- Use
SERVICE_LINE
,PROC_CODE
, andMEM_AGE
to predict future services. - Train a sequence model (e.g., RNN, LSTM) to analyze service sequences.
- Use
- Technologies:
- PyTorch/TensorFlow for sequence modeling.
- Flask or a chatbot integration for deployment.
- Impact: Enhances patient care by anticipating future needs.
10. Interactive Hackathon Data Explorer
- Objective: Build a user-friendly tool to explore the datasets visually.
- Approach:
- Provide filtering options for claims (
SERVICE_SETTING
,SV_STAT
). - Include geographic visualizations (e.g., state-level provider performance).
- Allow users to generate custom visualizations for their analyses.
- Provide filtering options for claims (
- Technologies:
- Dash/Streamlit for the interface.
- Plotly for interactive graphs.
- Impact: Empowers hackathon participants to explore and extract insights quickly.
Final Note
We’re thrilled to have you as part of this hackathon, and we can’t wait to see the innovative solutions you come up with! Whether you’re building predictive models, uncovering patterns, or exploring healthcare equity, this dataset offers countless opportunities to make a meaningful impact.
Remember, collaboration and creativity are key, so don’t hesitate to reach out to fellow participants or mentors if you need guidance. Most importantly, have fun, think big, and enjoy the process of bringing your ideas to life.
Good luck, and let the hacking begin!