Handling Nested JSON Structures for Machine Learning Data Preparation

👀 Views: 490 💬 Answers: 1 📅 Created: 2025-10-17

json pandas data-processing machine-learning python

I'm trying to configure I've tried everything I can think of but Working on a project where I'm preparing a dataset for a machine learning model, I've run into challenges with parsing nested JSON structures. The API I'm pulling data from returns a complex structure that includes multiple levels of nesting, which complicates the data extraction process. My current JSON response looks something like this: ```json { "data": { "users": [ { "id": 1, "name": "Alice", "address": { "city": "New York", "zip": "10001" }, "purchases": [ { "item": "Laptop", "price": 1200 }, { "item": "Mouse", "price": 50 } ] }, { "id": 2, "name": "Bob", "address": { "city": "Los Angeles", "zip": "90001" }, "purchases": [ { "item": "Keyboard", "price": 100 } ] } ] } } ``` To prepare my data, I need to flatten this structure for use in a Pandas DataFrame. I've tried using `json_normalize` but still find myself struggling with how to best extract the nested `address` and `purchases` details. Here’s a snippet of what I attempted: ```python import pandas as pd from pandas import json_normalize json_data = { ... } # JSON data as shown above # Attempt #1: Flattening using json_normalize normalized_data = json_normalize(json_data, record_path=['data', 'users'], meta=[['data', 'users', 'id'], ['data', 'users', 'name'], ['data', 'users', 'address']]) ``` This snippet resulted in a DataFrame but didn’t provide the flattened purchases directly, leaving me with a lot of manual work. Next, I tried unpacking the purchases separately: ```python purchases = [] for user in json_data['data']['users']: for purchase in user['purchases']: purchases.append({ 'user_id': user['id'], 'item': purchase['item'], 'price': purchase['price'], 'city': user['address']['city'] }) purchases_df = pd.DataFrame(purchases) ``` Now, I've successfully extracted user purchases, but I'm still facing the challenge of merging this back into a complete DataFrame that includes user information and addresses. Any tips on how to efficiently join these two DataFrames without running into index alignment issues? Additionally, if there are best practices for handling such nested JSON data structures in Python that you could recommend, I'd greatly appreciate it. I'm working with Python 3.9 and Pandas 1.3.3. Thank you! Thanks in advance! I'm coming from a different tech stack and learning Python. I appreciate any insights!