How to optimize `lm()` for large datasets in R without running into memory issues?

👀 Views: 2 💬 Answers: 1 📅 Created: 2025-06-16

I'm getting frustrated with Quick question that's been bugging me - I'm sure I'm missing something obvious here, but I'm working with a large dataset in R (around 10 million rows) and trying to fit a linear model using the `lm()` function. However, I'm running into memory allocation issues that cause R to crash. The dataset includes multiple predictors and is stored as a data frame. I initially tried using a standard call like this: ```r model <- lm(y ~ x1 + x2 + x3, data = large_df) ``` But after a few minutes, I get the behavior: `behavior: want to allocate vector of size X Gb`. I also attempted to reduce the memory load by subsetting the data: ```r subset_df <- large_df[sample(1:nrow(large_df), size = 100000), ] model <- lm(y ~ x1 + x2 + x3, data = subset_df) ``` This approach works, but I worry about losing important variability in the data. I've read about using `biglm` as a potential solution, but I'm unsure of how to implement it correctly for my case. I've also considered using data.table or dplyr to optimize data manipulation before modeling, but I'm not clear on the best practices for this kind of analysis with large datasets. Can anyone provide guidance on efficiently fitting a linear model on large datasets in R while managing memory usage effectively? I'm working on a CLI tool that needs to handle this. Thanks in advance! I'm working on a API that needs to handle this. Any ideas what could be causing this? For context: I'm using R on Ubuntu. How would you solve this? I'm working in a CentOS environment. I'm open to any suggestions. Any suggestions would be helpful.