Hackathon Data Access and Usage Guide

Welcome to the hackathon! This guide will help you understand the structure of the data provided, how it is organized, and how to efficiently process it for your projects.

You can request access to the data here


Overview of the Data

The dataset provides a comprehensive view of healthcare operations, focusing on member demographics, enrollment details, healthcare services, and provider information. It captures various interconnected aspects of the healthcare system, including patient enrollment in plans, the services they use, and the providers delivering those services. This granular data enables participants to analyze patterns, optimize healthcare costs, and explore equity in service delivery. The dataset is divided into several key components, each offering unique insights into the healthcare journey.


Structure and Size

  • Total Data:
    • Approximately 450,000 members with 37.5 million records, divided into two states: Hawaii and California.
  • File Size:
    • Compressed ZIP: ~2 GB.
    • Uncompressed Parquet Files: ~3.5 GB.
    • If converted to CSV: The total size may grow to ~10 GB.
  • Subfolder Organization:
    • Each dataset (e.g., Members, Enrollment, Services) is stored in a subfolder containing one or more Parquet files.
    • Each subfolder also includes two utility scripts:
      • convert_to_csv.py: Converts all Parquet files into one CSV file.
      • truncate_rows.py: Truncates the dataset to a smaller, manageable size.

Dataset Relationships

The data consists of several interconnected datasets that provide a comprehensive picture of patient care, member information, and healthcare costs. Here’s how they are related:

1. Members and Enrollment

  • Key Relationship: PRIMARY_PERSON_KEY, MEMBER_ID
  • Each member is identified by a unique key (PRIMARY_PERSON_KEY or MEMBER_ID).
  • These keys link demographic information (age, gender, race, etc.) in the Members dataset to the Enrollment dataset, providing details about when members started their healthcare plan and how long they have been enrolled.

2. Members and Services

  • Key Relationship: PRIMARY_PERSON_KEY, MEMBER_ID
  • The same unique keys (PRIMARY_PERSON_KEY or MEMBER_ID) link members from the Members dataset to the healthcare services they’ve used, stored in the Services dataset. This allows tracing which member received specific medical services (e.g., doctor visits or treatments).

3. Services and Providers

  • Key Relationship: PROV_KEY
  • The Services dataset is connected to the Providers dataset using the PROV_KEY, which identifies the healthcare providers who delivered the services. Providers could be individual practitioners, clinics, or hospitals, and their information includes type, location, and other attributes.

4. Overall Relationship

  • The datasets are interconnected in the following way:
    • MembersEnrollmentServicesProviders
  • These relationships allow for comprehensive analyses of:
    • Who are the members receiving care, and what are their demographics?
    • What services are being used, and how much do they cost?
    • Where are services being delivered, and by which providers?

File Structure Example

Below is an example of how the data and scripts are organized:

Hack_Lokahi/
├── Claims_Enrollment/
│   ├── part-00000-tid-<id>.parquet
│   ├── part-00001-tid-<id>.parquet
│   ├── ...
│   ├── convert_to_csv.py
│   ├── truncate_rows.py
├── Claims_Member/
│   ├── part-00000-tid-<id>.parquet
│   ├── convert_to_csv.py
│   ├── truncate_rows.py
├── Claims_Provider/
│   ├── part-00000-tid-<id>.parquet
│   ├── convert_to_csv.py
│   ├── truncate_rows.py
├── Claims_Services/
│   ├── part-00000-tid-<id>.parquet
│   ├── part-00000-tid-<id>.parquet
│   ├── part-00001-tid-<id>.parquet
│   ├── ...
│   ├── convert_to_csv.py
│   ├── truncate_rows.py

How to Work with the Data

1. Parquet Files

We recommend working directly with the Parquet files, as they are optimized for storage and processing. Here’s an example of how to load and explore a Parquet file:

import pandas as pd

# Load a single Parquet file
df = pd.read_parquet('path_to_your_file.parquet')

# Display the first few rows
print(df.head())

# Perform basic operations
print(df.info())

2. Convert Parquet to CSV

If you prefer working with CSV files, you can use the convert_to_csv.py script included in each subfolder.

Steps:

  1. Navigate to the Hack_lokahi folder.
  2. Adjust the path if needed.
  3. Run the script:
    python convert_to_csv.py
    
  4. The script will:
    • Combine all Parquet files in the subfolder into one CSV file.
    • Save the CSV file in the same folder.

3. Truncate Data for Smaller Files

If the data is too large for your use case, you can truncate it using the truncate_rows.py script.

Steps:

  1. Open the script and set the desired number of rows (N_ROWS).
  2. Run the script in the desired subfolder (e.g., Claims_Services):
    python truncate_rows.py
    
  3. The script will:
    • Combine all Parquet files.
    • Create a truncated version (both Parquet and CSV formats) in a new subfolder (e.g., truncated_Claims_Services).

Dataset Details

Below is a detailed description of each dataset. Replace the placeholders below with the specific tables for each dataset.


Member Enrollment

Column IndexColumn NameDescription and CodesFormat
1PRIMARY_PERSON_KEYUnique person indicatorString
2MEMBER_IDAnonymized PII tokenString
3MEMBER_MONTH_START_DATEIndicates the first of the month of the enrollment recordDate
4YEARMOEnrollment month in the format YYYYMMString
5MEM_AGEThe member's ageString
6RELATIONIndicates member's relationship to subscriber (e.g., Subscriber, Spouse, Dependent)String
7MEM_MSA_NAMEMember residence Metropolitan Statistical Area (MSA)String
8PAYER_LOBLine of Business: COMMERCIAL, MEDICARE, MEDICAID, or MEDICARE_SUPPLEMENTString
9PAYER_TYPEType of enrollment plan. See details in the 'Payer Type Detail' tabString
10PROD_TYPECoverage type: "Medical", "Dental", "Vision", "RX", or other contributor-defined valuesString
11QTY_MM_MDNumber of months of member coverage for medical servicesFloat
12QTY_MM_RXNumber of months of member coverage for prescription drug servicesFloat
13QTY_MM_DNNumber of months of member coverage for dental servicesFloat
14QTY_MM_VSNumber of months of member coverage for vision servicesFloat
15MEM_STATEmployment benefit status at the time of service (ACTIVE, COBRA, RETIREE)String
16PRIMARY_CHRONIC_CONDITION_ROLLUP_IDChronic Disease Group IDInteger
17PRIMARY_CHRONIC_CONDITION_ROLLUP_DESCChronic Disease Group DescriptionString

Members

Column IndexColumn NameDescription and CodesFormat
1PRIMARY_PERSON_KEYUnique person indicatorString
2MEMBER_IDAnonymized PII tokenString
3MEM_GENDERMember's gender (F, M, or U)String
4MEM_RACERace of the member (1 = Asian, 2 = Black, 3 = Caucasian, 4 = Other/Unknown)String
5MEM_ETHNICITYEthnicity (1 = Hispanic, 2 = Not Hispanic, 3 = Unknown)String
6MEM_ZIP3First three digits of the member's residence ZIP codeString
7MEM_MSA_NAMEMember residence Metropolitan Statistical Area (MSA)String
8MEM_STATEState of residence as defined in the source eligibility recordString

Services Care Cost

Column IndexColumn NameDescription and CodesFormat
1PRIMARY_PERSON_KEYUnique person indicatorString
2MEMBER_IDAnonymized PII token, where PII is availableString
3CLAIM_ID_KEYUnique internal claim identifier, unlinkable to any external claim record sourceDecimal(38,0)
4SERVICE_LINEUnique number within a claim identifying a unique service line item renderedInteger
5SV_STATStatus of claim: P=Paid, D=Denied, V=Voided (line never paid), O=Open/Pending, R=Reversed, E=Encounter (statistical claim from cap contract)String
7BILL_PROV_KEYInternal identifier for the billing entityString
8REF_PROV_KEYInternal identifier for the referring, ordering, or prescribing entity, if availableString
9ATT_PROV_KEYInternal identifier for the servicing/attending entity, if availableString
10YEARMOService month in the format YYYYMMString
11FROM_DATEThe first detail date of service as supplied on detail lines of a claimDate
12TO_DATEThe last detail date of service as supplied on detail lines of a claimDate
13PAID_DATEDate the claim line is considered paid by the plan for general ledger purposesDate
14ADM_DATEAdmission date at the day levelDate
15DIS_DATEDischarge date at the day levelDate
16AGE_ON_DOSThe age of the member as of the date of service of the claim lineString
17RELATIONIndicates the member's relationship to the subscriber (Subscriber, Spouse, Dependent)String
18PAYER_LOBLine of Business: COMMERCIAL, MEDICARE, MEDICAID, MEDICARE_SUPPLEMENTString
19PAYER_TYPEType of enrollment plan member is covered under. See values in the 'Payer Type Detail' tabString
20MEM_STATEmployment benefit status of the member at the time of service (ACTIVE, COBRA, RETIREE)String
21CLAIM_IN_NETWORKIndicates if claim was paid as in-network or out-of-networkString
22SERVICE_SETTINGService setting based on a proprietary methodology (e.g., Inpatient, Outpatient, Professional, Rx, Other)String
23FORM_TYPEType of claim: U=UB, H=CMS1500, D=Prescription Drug, A=Dental, V=Vision, L=LabString
24UB_BILL_TYPEIndustry standard codesString
25POSIndustry standard place of service codeString
26MS_DRGMS-DRG code assigned by the MS-DRG grouping softwareString
27REV_CODERevenue code and descriptionString
28PROC_CODEProcedure code and nameString
29CPT_MOD_1Procedure code modifier 1String
30CPT_MOD_2Procedure code modifier 2String
31CPT_CCSProcedure Clinical Classification Software categories descriptionString
32CPT_CCS_LABELProcedure Clinical Classification Software categories descriptionString
33ADM_SRCIndustry Standard Admit SourceString
34ADM_TYPEIndustry Standard Admit TypeString
35DIS_STATUB Discharge Status: Industry standard codesString
36ICD_DIAG_ADMITAdmitting ICD diagnosis codeString
37-66ICD_DIAG_01 - ICD_DIAG_30ICD diagnosis (1 Primary Diagnosis, 2 - 30 Secondary). Includes all relevant codes.String
67-96ICD_DIAG_01_POA - ICD_DIAG_30_POAICD diagnosis present at time of admission: Y=Yes, N=No, U=Unknown, W=Undetermined, 1=Exempt from POA reportingString
97-126ICD_PROC_01 - ICD_PROC_30ICD Surgical Procedure CodesString
127DIAG_CCS_1_LABELMulti-level clinical category description from AHRQ Clinical Classifications Software (CCS) for the primary diagnosisString
128DIAG_CCS_2_LABELMulti-level 2 clinical category description from AHRQ Clinical Classifications Software (CCS)String
129DIAG_CCS_3_LABELMulti-level 3 clinical category description from AHRQ Clinical Classifications Software (CCS)String
130NDC_CODENational Drug CodeString
131RX_DAYS_SUPPLYPrescription days' supplyFloat
132RX_REFILLSNumber of prescription refillsSmallint
133RX_FILL_SRCDrug dispense source: R=Retail, M=Mail, U=UnknownString
134RX_DRUG_COSTCost of the drug in dollarsFloat
135RX_INGR_COSTCost of the ingredients in the drugFloat
136RX_QTY_DISPENSEDQuantity of drug dispensedDecimal(18,2)
137RX_DISP_FEEPrescription dispensing feeDecimal(18,2)
138RX_FORMIndicates if drug dispensed was formulary: 1=Formulary, 0=Non-formularyInteger
139AMT_BILLEDCharge amount (billed) is the amount submitted by the service providerDecimal(18,2)
140AMT_ALLOWEDNegotiated allowed amount after discountsDecimal(18,2)
141AMT_COBAmount paid from another payer (e.g., auto claim, workers' comp)Decimal(18,2)
142AMT_COPAYCopay amount the individual is responsible forDecimal(18,2)
143AMT_DEDUCTDeductible amount the member pays to the providerDecimal(18,2)
144AMT_COINSCoinsurance amount the individual is responsible forDecimal(18,2)
145AMT_PAIDAmount paid to the provider by the planDecimal(18,2)
146AMT_DISALLOWEDAmount not allowed to be paid to the providerDecimal(18,2)
147SV_UNITSNumber of units dispensed or administeredDecimal(12,2)
148DIAGNOSTIC_CONDITION_CATEGORY_IDDiagnostic condition categoryString
149DIAGNOSTIC_CONDITION_CATEGORY_DESCDiagnostic condition category descriptionString

Providers

Column IndexColumn NameDescription and CodesFormat
1PROV_KEYUnique provider IDString
2PROV_TYPEProvider type (Practitioner, Organization)String
3PROV_NPI_ORGNational Provider Identifier (NPI)String
4PROV_CLINIC_STATEState of the provider's clinicString
5PROV_CLINIC_ZIPZIP code of the provider's clinicString
6PROV_TAXONOMYProvider specialty taxonomy codeString

Key Recommendations

  1. Use Parquet Files: These are efficient for both storage and processing.
  2. Mind the Size:
    • The uncompressed data is ~2GB, but converting to CSV can increase it to ~10 GB.
    • Use the truncation script for smaller datasets.
  3. Scripts Are Pre-Included:
    • Each subfolder already includes convert_to_csv.py and truncate_rows.py.

Example Potential Applications

1. AI-Powered Healthcare Cost Analyzer

  • Objective: Build an interactive tool to analyze and visualize healthcare costs for patients and providers.
  • Approach:
    • Use the Services Care Cost dataset to create a dashboard that shows:
      • Cost trends by service type (PROC_CODE, SERVICE_SETTING).
      • Differences between in-network and out-of-network claims (CLAIM_IN_NETWORK).
      • Visualizations of average costs by provider (PROV_KEY).
    • Include an AI model to predict future costs for a patient based on past claims.
  • Technologies:
    • Python (Streamlit for the interface).
    • Machine learning (time series forecasting with ARIMA or Prophet).
  • Impact: Provides actionable insights into healthcare spending patterns for patients and insurers.

2. Claim Denial Prediction System

  • Objective: Create a machine learning model to predict claim denial and suggest ways to prevent it.
  • Approach:
    • Use the SV_STAT column in the Services Care Cost dataset to train a classification model (e.g., predict D for denied claims).
    • Analyze the reasons for denials by correlating them with SERVICE_SETTING, FORM_TYPE, and POS.
    • Add a feature to flag high-risk claims for further review.
  • Technologies:
    • Scikit-learn or XGBoost for model training.
    • Flask or FastAPI to serve the predictions.
  • Impact: Helps insurers and providers reduce claim rejections, saving time and money.

3. Provider Performance Dashboard

  • Objective: Rank providers based on cost-effectiveness and service outcomes.
  • Approach:
    • Use Services Care Cost and Providers data to calculate metrics like:
      • Average cost per claim.
      • Service setting distribution (e.g., inpatient vs. outpatient).
      • Claim approval rates (SV_STAT).
    • Visualize rankings on an interactive map using provider ZIP codes (PROV_CLINIC_ZIP).
  • Technologies:
    • Tableau or Python (Dash/Streamlit).
    • Geospatial libraries (folium or Plotly Maps).
  • Impact: Enables users to make informed choices when selecting healthcare providers.

4. AI-Powered Personalized Health Plan Selector

  • Objective: Recommend the best health insurance plan for users based on their healthcare usage history.
  • Approach:
    • Analyze historical claims data (Services Care Cost) and demographic information (Members) to determine typical service needs.
    • Use clustering to group users with similar usage patterns.
    • Recommend plans (PAYER_TYPE) based on predicted costs for medical, dental, vision, and prescription services.
  • Technologies:
    • K-Means for clustering.
    • Streamlit for a recommendation interface.
  • Impact: Makes healthcare more accessible by simplifying the decision-making process.

5. Fraud Detection in Claims

  • Objective: Build a model to detect suspicious claims and prevent fraud.
  • Approach:
    • Identify anomalies in AMT_BILLED, PROC_CODE, and SERVICE_LINE.
    • Use unsupervised learning (e.g., autoencoders or isolation forests) to detect irregular patterns in claims.
    • Highlight unusual cases in a dashboard for further investigation.
  • Technologies:
    • PyTorch/TensorFlow for anomaly detection.
    • Dash/Plotly for visualization.
  • Impact: Helps insurers save money by flagging fraudulent or inflated claims.

6. Health Equity Explorer

  • Objective: Build a tool to identify disparities in healthcare access and outcomes.
  • Approach:
    • Use Members data (race, ethnicity, ZIP code) and claims data (Services Care Cost) to analyze:
      • Differences in service usage.
      • Costs and outcomes across demographic groups.
      • Geographic disparities in access (MEM_STATE, PROV_CLINIC_STATE).
    • Visualize insights in an interactive dashboard.
  • Technologies:
    • Python for analysis (Pandas, Matplotlib).
    • Dash for interactive visualizations.
  • Impact: Informs policymakers and organizations about gaps in healthcare access.

7. Prescription Cost Optimizer

  • Objective: Recommend cost-effective prescription options for patients.
  • Approach:
    • Use RX_DRUG_COST, RX_QTY_DISPENSED, and RX_FORM to find trends in drug pricing.
    • Suggest alternatives (e.g., generics vs. brand-name drugs) based on cost and formulary status (RX_FORM).
    • Include visualizations of drug costs by provider or pharmacy.
  • Technologies:
    • Python (Streamlit for UI).
    • Machine learning for cost prediction and recommendations.
  • Impact: Reduces prescription costs for patients and insurers.

8. Predictive Chronic Disease Risk Tool

  • Objective: Predict a member’s risk of developing chronic diseases based on past claims and demographic data.
  • Approach:
    • Train a model using PRIMARY_CHRONIC_CONDITION_ROLLUP_DESC, service history (PROC_CODE), and demographic data (MEM_AGE, MEM_RACE).
    • Provide risk scores and recommend preventive measures.
  • Technologies:
    • Gradient Boosted Trees (e.g., XGBoost, LightGBM).
    • A simple web interface using Flask or Streamlit.
  • Impact: Improves health outcomes by enabling early interventions.

9. AI-Driven Service Line Prediction

  • Objective: Predict the next likely service line a member might use based on their history.
  • Approach:
    • Use SERVICE_LINE, PROC_CODE, and MEM_AGE to predict future services.
    • Train a sequence model (e.g., RNN, LSTM) to analyze service sequences.
  • Technologies:
    • PyTorch/TensorFlow for sequence modeling.
    • Flask or a chatbot integration for deployment.
  • Impact: Enhances patient care by anticipating future needs.

10. Interactive Hackathon Data Explorer

  • Objective: Build a user-friendly tool to explore the datasets visually.
  • Approach:
    • Provide filtering options for claims (SERVICE_SETTING, SV_STAT).
    • Include geographic visualizations (e.g., state-level provider performance).
    • Allow users to generate custom visualizations for their analyses.
  • Technologies:
    • Dash/Streamlit for the interface.
    • Plotly for interactive graphs.
  • Impact: Empowers hackathon participants to explore and extract insights quickly.

Final Note

We’re thrilled to have you as part of this hackathon, and we can’t wait to see the innovative solutions you come up with! Whether you’re building predictive models, uncovering patterns, or exploring healthcare equity, this dataset offers countless opportunities to make a meaningful impact.

Remember, collaboration and creativity are key, so don’t hesitate to reach out to fellow participants or mentors if you need guidance. Most importantly, have fun, think big, and enjoy the process of bringing your ideas to life.

Good luck, and let the hacking begin!