Hackathon Data Access and Usage Guide

Welcome to the hackathon! This guide will help you understand the structure of the data provided, how it is organized, and how to efficiently process it for your projects.

You can request access to the data here

Overview of the Data

The dataset provides a comprehensive view of healthcare operations, focusing on member demographics, enrollment details, healthcare services, and provider information. It captures various interconnected aspects of the healthcare system, including patient enrollment in plans, the services they use, and the providers delivering those services. This granular data enables participants to analyze patterns, optimize healthcare costs, and explore equity in service delivery. The dataset is divided into several key components, each offering unique insights into the healthcare journey.

Structure and Size

Total Data:
- Approximately 450,000 members with 37.5 million records, divided into two states: Hawaii and California.
File Size:
- Compressed ZIP: ~2 GB.
- Uncompressed Parquet Files: ~3.5 GB.
- If converted to CSV: The total size may grow to ~10 GB.
Subfolder Organization:
- Each dataset (e.g., Members, Enrollment, Services) is stored in a subfolder containing one or more Parquet files.
- Each subfolder also includes two utility scripts:
  - convert_to_csv.py: Converts all Parquet files into one CSV file.
  - truncate_rows.py: Truncates the dataset to a smaller, manageable size.

Dataset Relationships

The data consists of several interconnected datasets that provide a comprehensive picture of patient care, member information, and healthcare costs. Here’s how they are related:

1. Members and Enrollment

Key Relationship: PRIMARY_PERSON_KEY, MEMBER_ID
Each member is identified by a unique key (PRIMARY_PERSON_KEY or MEMBER_ID).
These keys link demographic information (age, gender, race, etc.) in the Members dataset to the Enrollment dataset, providing details about when members started their healthcare plan and how long they have been enrolled.

2. Members and Services

Key Relationship: PRIMARY_PERSON_KEY, MEMBER_ID
The same unique keys (PRIMARY_PERSON_KEY or MEMBER_ID) link members from the Members dataset to the healthcare services they’ve used, stored in the Services dataset. This allows tracing which member received specific medical services (e.g., doctor visits or treatments).

3. Services and Providers

Key Relationship: PROV_KEY
The Services dataset is connected to the Providers dataset using the PROV_KEY, which identifies the healthcare providers who delivered the services. Providers could be individual practitioners, clinics, or hospitals, and their information includes type, location, and other attributes.

4. Overall Relationship

The datasets are interconnected in the following way:
- Members → Enrollment → Services → Providers
These relationships allow for comprehensive analyses of:
- Who are the members receiving care, and what are their demographics?
- What services are being used, and how much do they cost?
- Where are services being delivered, and by which providers?

File Structure Example

Below is an example of how the data and scripts are organized:

Hack_Lokahi/
├── Claims_Enrollment/
│   ├── part-00000-tid-&lt;id&gt;.parquet
│   ├── part-00001-tid-&lt;id&gt;.parquet
│   ├── ...
│   ├── convert_to_csv.py
│   ├── truncate_rows.py
├── Claims_Member/
│   ├── part-00000-tid-&lt;id&gt;.parquet
│   ├── convert_to_csv.py
│   ├── truncate_rows.py
├── Claims_Provider/
│   ├── part-00000-tid-&lt;id&gt;.parquet
│   ├── convert_to_csv.py
│   ├── truncate_rows.py
├── Claims_Services/
│   ├── part-00000-tid-&lt;id&gt;.parquet
│   ├── part-00000-tid-&lt;id&gt;.parquet
│   ├── part-00001-tid-&lt;id&gt;.parquet
│   ├── ...
│   ├── convert_to_csv.py
│   ├── truncate_rows.py

How to Work with the Data

1. Parquet Files

We recommend working directly with the Parquet files, as they are optimized for storage and processing. Here’s an example of how to load and explore a Parquet file:

import pandas as pd

# Load a single Parquet file
df = pd.read_parquet('path_to_your_file.parquet')

# Display the first few rows
print(df.head())

# Perform basic operations
print(df.info())

2. Convert Parquet to CSV

If you prefer working with CSV files, you can use the convert_to_csv.py script included in each subfolder.

Steps:

Navigate to the Hack_lokahi folder.
Adjust the path if needed.
Run the script:
```
python convert_to_csv.py
```
The script will:
- Combine all Parquet files in the subfolder into one CSV file.
- Save the CSV file in the same folder.

3. Truncate Data for Smaller Files

If the data is too large for your use case, you can truncate it using the truncate_rows.py script.

Steps:

Open the script and set the desired number of rows (N_ROWS).
Run the script in the desired subfolder (e.g., Claims_Services):
```
python truncate_rows.py
```
The script will:
- Combine all Parquet files.
- Create a truncated version (both Parquet and CSV formats) in a new subfolder (e.g., truncated_Claims_Services).

Dataset Details

Below is a detailed description of each dataset. Replace the placeholders below with the specific tables for each dataset.

Member Enrollment

Column Index	Column Name	Description and Codes	Format
1	`PRIMARY_PERSON_KEY`	Unique person indicator	String
2	`MEMBER_ID`	Anonymized PII token	String
3	`MEMBER_MONTH_START_DATE`	Indicates the first of the month of the enrollment record	Date
4	`YEARMO`	Enrollment month in the format YYYYMM	String
5	`MEM_AGE`	The member's age	String
6	`RELATION`	Indicates member's relationship to subscriber (e.g., Subscriber, Spouse, Dependent)	String
7	`MEM_MSA_NAME`	Member residence Metropolitan Statistical Area (MSA)	String
8	`PAYER_LOB`	Line of Business: COMMERCIAL, MEDICARE, MEDICAID, or MEDICARE_SUPPLEMENT	String
9	`PAYER_TYPE`	Type of enrollment plan. See details in the 'Payer Type Detail' tab	String
10	`PROD_TYPE`	Coverage type: "Medical", "Dental", "Vision", "RX", or other contributor-defined values	String
11	`QTY_MM_MD`	Number of months of member coverage for medical services	Float
12	`QTY_MM_RX`	Number of months of member coverage for prescription drug services	Float
13	`QTY_MM_DN`	Number of months of member coverage for dental services	Float
14	`QTY_MM_VS`	Number of months of member coverage for vision services	Float
15	`MEM_STAT`	Employment benefit status at the time of service (ACTIVE, COBRA, RETIREE)	String
16	`PRIMARY_CHRONIC_CONDITION_ROLLUP_ID`	Chronic Disease Group ID	Integer
17	`PRIMARY_CHRONIC_CONDITION_ROLLUP_DESC`	Chronic Disease Group Description	String

Members

Column Index	Column Name	Description and Codes	Format
1	`PRIMARY_PERSON_KEY`	Unique person indicator	String
2	`MEMBER_ID`	Anonymized PII token	String
3	`MEM_GENDER`	Member's gender (F, M, or U)	String
4	`MEM_RACE`	Race of the member (1 = Asian, 2 = Black, 3 = Caucasian, 4 = Other/Unknown)	String
5	`MEM_ETHNICITY`	Ethnicity (1 = Hispanic, 2 = Not Hispanic, 3 = Unknown)	String
6	`MEM_ZIP3`	First three digits of the member's residence ZIP code	String
7	`MEM_MSA_NAME`	Member residence Metropolitan Statistical Area (MSA)	String
8	`MEM_STATE`	State of residence as defined in the source eligibility record	String

Services Care Cost

Column Index	Column Name	Description and Codes	Format
1	`PRIMARY_PERSON_KEY`	Unique person indicator	String
2	`MEMBER_ID`	Anonymized PII token, where PII is available	String
3	`CLAIM_ID_KEY`	Unique internal claim identifier, unlinkable to any external claim record source	Decimal(38,0)
4	`SERVICE_LINE`	Unique number within a claim identifying a unique service line item rendered	Integer
5	`SV_STAT`	Status of claim: P=Paid, D=Denied, V=Voided (line never paid), O=Open/Pending, R=Reversed, E=Encounter (statistical claim from cap contract)	String
7	`BILL_PROV_KEY`	Internal identifier for the billing entity	String
8	`REF_PROV_KEY`	Internal identifier for the referring, ordering, or prescribing entity, if available	String
9	`ATT_PROV_KEY`	Internal identifier for the servicing/attending entity, if available	String
10	`YEARMO`	Service month in the format YYYYMM	String
11	`FROM_DATE`	The first detail date of service as supplied on detail lines of a claim	Date
12	`TO_DATE`	The last detail date of service as supplied on detail lines of a claim	Date
13	`PAID_DATE`	Date the claim line is considered paid by the plan for general ledger purposes	Date
14	`ADM_DATE`	Admission date at the day level	Date
15	`DIS_DATE`	Discharge date at the day level	Date
16	`AGE_ON_DOS`	The age of the member as of the date of service of the claim line	String
17	`RELATION`	Indicates the member's relationship to the subscriber (Subscriber, Spouse, Dependent)	String
18	`PAYER_LOB`	Line of Business: COMMERCIAL, MEDICARE, MEDICAID, MEDICARE_SUPPLEMENT	String
19	`PAYER_TYPE`	Type of enrollment plan member is covered under. See values in the 'Payer Type Detail' tab	String
20	`MEM_STAT`	Employment benefit status of the member at the time of service (ACTIVE, COBRA, RETIREE)	String
21	`CLAIM_IN_NETWORK`	Indicates if claim was paid as in-network or out-of-network	String
22	`SERVICE_SETTING`	Service setting based on a proprietary methodology (e.g., Inpatient, Outpatient, Professional, Rx, Other)	String
23	`FORM_TYPE`	Type of claim: U=UB, H=CMS1500, D=Prescription Drug, A=Dental, V=Vision, L=Lab	String
24	`UB_BILL_TYPE`	Industry standard codes	String
25	`POS`	Industry standard place of service code	String
26	`MS_DRG`	MS-DRG code assigned by the MS-DRG grouping software	String
27	`REV_CODE`	Revenue code and description	String
28	`PROC_CODE`	Procedure code and name	String
29	`CPT_MOD_1`	Procedure code modifier 1	String
30	`CPT_MOD_2`	Procedure code modifier 2	String
31	`CPT_CCS`	Procedure Clinical Classification Software categories description	String
32	`CPT_CCS_LABEL`	Procedure Clinical Classification Software categories description	String
33	`ADM_SRC`	Industry Standard Admit Source	String
34	`ADM_TYPE`	Industry Standard Admit Type	String
35	`DIS_STAT`	UB Discharge Status: Industry standard codes	String
36	`ICD_DIAG_ADMIT`	Admitting ICD diagnosis code	String
37-66	`ICD_DIAG_01 - ICD_DIAG_30`	ICD diagnosis (1 Primary Diagnosis, 2 - 30 Secondary). Includes all relevant codes.	String
67-96	`ICD_DIAG_01_POA - ICD_DIAG_30_POA`	ICD diagnosis present at time of admission: Y=Yes, N=No, U=Unknown, W=Undetermined, 1=Exempt from POA reporting	String
97-126	`ICD_PROC_01 - ICD_PROC_30`	ICD Surgical Procedure Codes	String
127	`DIAG_CCS_1_LABEL`	Multi-level clinical category description from AHRQ Clinical Classifications Software (CCS) for the primary diagnosis	String
128	`DIAG_CCS_2_LABEL`	Multi-level 2 clinical category description from AHRQ Clinical Classifications Software (CCS)	String
129	`DIAG_CCS_3_LABEL`	Multi-level 3 clinical category description from AHRQ Clinical Classifications Software (CCS)	String
130	`NDC_CODE`	National Drug Code	String
131	`RX_DAYS_SUPPLY`	Prescription days' supply	Float
132	`RX_REFILLS`	Number of prescription refills	Smallint
133	`RX_FILL_SRC`	Drug dispense source: R=Retail, M=Mail, U=Unknown	String
134	`RX_DRUG_COST`	Cost of the drug in dollars	Float
135	`RX_INGR_COST`	Cost of the ingredients in the drug	Float
136	`RX_QTY_DISPENSED`	Quantity of drug dispensed	Decimal(18,2)
137	`RX_DISP_FEE`	Prescription dispensing fee	Decimal(18,2)
138	`RX_FORM`	Indicates if drug dispensed was formulary: 1=Formulary, 0=Non-formulary	Integer
139	`AMT_BILLED`	Charge amount (billed) is the amount submitted by the service provider	Decimal(18,2)
140	`AMT_ALLOWED`	Negotiated allowed amount after discounts	Decimal(18,2)
141	`AMT_COB`	Amount paid from another payer (e.g., auto claim, workers' comp)	Decimal(18,2)
142	`AMT_COPAY`	Copay amount the individual is responsible for	Decimal(18,2)
143	`AMT_DEDUCT`	Deductible amount the member pays to the provider	Decimal(18,2)
144	`AMT_COINS`	Coinsurance amount the individual is responsible for	Decimal(18,2)
145	`AMT_PAID`	Amount paid to the provider by the plan	Decimal(18,2)
146	`AMT_DISALLOWED`	Amount not allowed to be paid to the provider	Decimal(18,2)
147	`SV_UNITS`	Number of units dispensed or administered	Decimal(12,2)
148	`DIAGNOSTIC_CONDITION_CATEGORY_ID`	Diagnostic condition category	String
149	`DIAGNOSTIC_CONDITION_CATEGORY_DESC`	Diagnostic condition category description	String

Providers

Column Index	Column Name	Description and Codes	Format
1	`PROV_KEY`	Unique provider ID	String
2	`PROV_TYPE`	Provider type (Practitioner, Organization)	String
3	`PROV_NPI_ORG`	National Provider Identifier (NPI)	String
4	`PROV_CLINIC_STATE`	State of the provider's clinic	String
5	`PROV_CLINIC_ZIP`	ZIP code of the provider's clinic	String
6	`PROV_TAXONOMY`	Provider specialty taxonomy code	String

Key Recommendations

Use Parquet Files: These are efficient for both storage and processing.
Mind the Size:
- The uncompressed data is ~2GB, but converting to CSV can increase it to ~10 GB.
- Use the truncation script for smaller datasets.
Scripts Are Pre-Included:
- Each subfolder already includes convert_to_csv.py and truncate_rows.py.

Example Potential Applications

1. AI-Powered Healthcare Cost Analyzer

Objective: Build an interactive tool to analyze and visualize healthcare costs for patients and providers.
Approach:
- Use the Services Care Cost dataset to create a dashboard that shows:
  - Cost trends by service type (PROC_CODE, SERVICE_SETTING).
  - Differences between in-network and out-of-network claims (CLAIM_IN_NETWORK).
  - Visualizations of average costs by provider (PROV_KEY).
- Include an AI model to predict future costs for a patient based on past claims.
Technologies:
- Python (Streamlit for the interface).
- Machine learning (time series forecasting with ARIMA or Prophet).
Impact: Provides actionable insights into healthcare spending patterns for patients and insurers.

2. Claim Denial Prediction System

Objective: Create a machine learning model to predict claim denial and suggest ways to prevent it.
Approach:
- Use the SV_STAT column in the Services Care Cost dataset to train a classification model (e.g., predict D for denied claims).
- Analyze the reasons for denials by correlating them with SERVICE_SETTING, FORM_TYPE, and POS.
- Add a feature to flag high-risk claims for further review.
Technologies:
- Scikit-learn or XGBoost for model training.
- Flask or FastAPI to serve the predictions.
Impact: Helps insurers and providers reduce claim rejections, saving time and money.

3. Provider Performance Dashboard

Objective: Rank providers based on cost-effectiveness and service outcomes.
Approach:
- Use Services Care Cost and Providers data to calculate metrics like:
  - Average cost per claim.
  - Service setting distribution (e.g., inpatient vs. outpatient).
  - Claim approval rates (SV_STAT).
- Visualize rankings on an interactive map using provider ZIP codes (PROV_CLINIC_ZIP).
Technologies:
- Tableau or Python (Dash/Streamlit).
- Geospatial libraries (folium or Plotly Maps).
Impact: Enables users to make informed choices when selecting healthcare providers.

4. AI-Powered Personalized Health Plan Selector

Objective: Recommend the best health insurance plan for users based on their healthcare usage history.
Approach:
- Analyze historical claims data (Services Care Cost) and demographic information (Members) to determine typical service needs.
- Use clustering to group users with similar usage patterns.
- Recommend plans (PAYER_TYPE) based on predicted costs for medical, dental, vision, and prescription services.
Technologies:
- K-Means for clustering.
- Streamlit for a recommendation interface.
Impact: Makes healthcare more accessible by simplifying the decision-making process.

5. Fraud Detection in Claims

Objective: Build a model to detect suspicious claims and prevent fraud.
Approach:
- Identify anomalies in AMT_BILLED, PROC_CODE, and SERVICE_LINE.
- Use unsupervised learning (e.g., autoencoders or isolation forests) to detect irregular patterns in claims.
- Highlight unusual cases in a dashboard for further investigation.
Technologies:
- PyTorch/TensorFlow for anomaly detection.
- Dash/Plotly for visualization.
Impact: Helps insurers save money by flagging fraudulent or inflated claims.

6. Health Equity Explorer

Objective: Build a tool to identify disparities in healthcare access and outcomes.
Approach:
- Use Members data (race, ethnicity, ZIP code) and claims data (Services Care Cost) to analyze:
  - Differences in service usage.
  - Costs and outcomes across demographic groups.
  - Geographic disparities in access (MEM_STATE, PROV_CLINIC_STATE).
- Visualize insights in an interactive dashboard.
Technologies:
- Python for analysis (Pandas, Matplotlib).
- Dash for interactive visualizations.
Impact: Informs policymakers and organizations about gaps in healthcare access.

7. Prescription Cost Optimizer

Objective: Recommend cost-effective prescription options for patients.
Approach:
- Use RX_DRUG_COST, RX_QTY_DISPENSED, and RX_FORM to find trends in drug pricing.
- Suggest alternatives (e.g., generics vs. brand-name drugs) based on cost and formulary status (RX_FORM).
- Include visualizations of drug costs by provider or pharmacy.
Technologies:
- Python (Streamlit for UI).
- Machine learning for cost prediction and recommendations.
Impact: Reduces prescription costs for patients and insurers.

8. Predictive Chronic Disease Risk Tool

Objective: Predict a member’s risk of developing chronic diseases based on past claims and demographic data.
Approach:
- Train a model using PRIMARY_CHRONIC_CONDITION_ROLLUP_DESC, service history (PROC_CODE), and demographic data (MEM_AGE, MEM_RACE).
- Provide risk scores and recommend preventive measures.
Technologies:
- Gradient Boosted Trees (e.g., XGBoost, LightGBM).
- A simple web interface using Flask or Streamlit.
Impact: Improves health outcomes by enabling early interventions.

9. AI-Driven Service Line Prediction

Objective: Predict the next likely service line a member might use based on their history.
Approach:
- Use SERVICE_LINE, PROC_CODE, and MEM_AGE to predict future services.
- Train a sequence model (e.g., RNN, LSTM) to analyze service sequences.
Technologies:
- PyTorch/TensorFlow for sequence modeling.
- Flask or a chatbot integration for deployment.
Impact: Enhances patient care by anticipating future needs.

10. Interactive Hackathon Data Explorer

Objective: Build a user-friendly tool to explore the datasets visually.
Approach:
- Provide filtering options for claims (SERVICE_SETTING, SV_STAT).
- Include geographic visualizations (e.g., state-level provider performance).
- Allow users to generate custom visualizations for their analyses.
Technologies:
- Dash/Streamlit for the interface.
- Plotly for interactive graphs.
Impact: Empowers hackathon participants to explore and extract insights quickly.

Final Note

We’re thrilled to have you as part of this hackathon, and we can’t wait to see the innovative solutions you come up with! Whether you’re building predictive models, uncovering patterns, or exploring healthcare equity, this dataset offers countless opportunities to make a meaningful impact.

Remember, collaboration and creativity are key, so don’t hesitate to reach out to fellow participants or mentors if you need guidance. Most importantly, have fun, think big, and enjoy the process of bringing your ideas to life.

Good luck, and let the hacking begin!

Edit on Github