Home

A multiple linear regression approach to predicting the number of minutes needed to make your dinner.

Introduction

When you’re hungry and scrolling through recipes, one of the first questions on your mind is: how long is this going to take me? Some recipe sites give prep time estimates, but they’re often vague, inconsistent, or overly optimistic. Wouldn’t it be nice to have a more systematic way to predict cooking time from the actual recipe data?

This project uses a dataset from Food.com, which includes two main CSVs: one with detailed recipe information (ingredients, steps, nutrition, tags) and another with user-submitted ratings. After merging and cleaning this data, we aim to build a model that predicts the total number of minutes a recipe will take to make.

The main guiding question:

Is there a definable relationship between a recipe’s characteristics (like number of steps, ingredients, and calories) and how long it takes to cook?

To explore this, we’ll start by loading and cleaning the data, then we’ll dive into some exploratory analysis, followed by building a predictive model with linear regression and then a more complex, feature-rich pipeline.

To start let’s report some information about our datasets. We have two: recipes.csv and ratings.csv. The recipes.csv has 83782 rows or recipes. The ratings.csv contaings every interaction that a user has with a given recipe. That means there are significantly more rows in ratings.csv than recipes.csv. The columns in each row and descriptions of them are given below:

Recipes Table

Column	Description
`name`	Recipe name
`id`	Recipe ID
`minutes`	Minutes to prepare recipe
`contributor_id`	User ID who submitted this recipe
`submitted`	Date recipe was submitted
`tags`	Food.com tags for recipe
`nutrition`	Nutrition information in the form `[calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]`; PDV stands for “percentage of daily value”
`n_steps`	Number of steps in recipe
`steps`	Text for recipe steps, in order
`description`	User-provided description

Ratings Table

Column	Description
`user_id`	User ID
`recipe_id`	Recipe ID
`date`	Date of interaction
`rating`	Rating given
`review`	Review text

Data Cleaning and Exploratory Data Analysis

Cleaning our data

We began by merging the two datasets (recipes.csv and ratings.csv) on their shared id column so that each recipe would include the user feedback. From the ratings, we computed the average rating per recipe and added this as a new column.

Interestingly, we noticed that some ratings were recorded as 0, which isn’t part of the expected 1–5 scale. We interpreted these as missing or invalid entries, and treated them accordingly. For recipes with no valid ratings, we imputed the missing average rating using the overall mean rating from the dataset. This ensured that every recipe had a value for avg_rating and prevented missing data from derailing our model later.

The distribtuion of the avg_rating column pre and post imputation is shown below.

Figure 1: Rating before and after mean imputation

We can see that mean imputation does not destroy the overall trend of the data, so it seems like a fine technique to use.

The nutrition column of our resulting dataset was stored as string representations of Python lists — for example, the ingredients, steps, tags, and nutrition fields. This is a bit unqeildy, so we used the ast.literal_eval() function to convert them into actual lists. We then unpacked those lists into columns of their own giving us the following columns:

calories
total fat
sugar
sodium
protein
saturated fat
carbohydrates

Filtering for outliers

During our EDA it was found that many of the numerical columns like minutes, calories, etc had some very obnoxious outliers that were not realistic values at all. Realistically, we plan for this model to be used by hungry and eachausted college students who come home and just want a quick estimate of how long dinner or any other meal is going to tak them to make. For this reason we decided to remove all recipes that aren’t inline with the recommended number of macronutrients for a given meal:

For the macronutrient columns (calories,protein,total fat,sugar,saturated fat), an arbitrary calorie cap of 1000 cal was chosen, and the bounds for the other macros were chosen based off the AMDR’s guidelines for an average healthy adult’s nutritional intake. Since the data was already in a PDV format, it was easy to filter based off the given AMDR guidelines. The cutoffs are given below:

Macronutrient	Recommended Range (% Daily Calories)
Carbohydrates	45–65%
Fat	20–35%
Protein	10–35%
Added Sugars	<10%
Saturated Fat	<10%

For the minutes column, a cutoff of 65 minutes was chosen in part because it is the 75th quartile of the minutes values, and because a person making dinner isn’t likely to spend more than an hour making dinner.

To make it easier on a potential user of this model, the macronutrient columns (calories,protein,total fat,sugar,saturated fat), were converted from PDV to grams, as its more likely a user would know the macaronutrients in their recipe as grams, not as a PDV.

Figure 2: Minutes Before Filtering

Figure 3: Minutes After Filtering

Determining recipe types

Many recipes have a list of tags (like “vegetarian” or “dessert”), but we created a simpler recipe type feature using text analysis. We applied TF-IDF on the following columns post cleaning to find characteristic keywords for each recipe; name,tags,steps,description,ingredients,review. Each recipe was then labeled with the top TF-IDF keyword as its “type.” We then only kept the top 40 most frequent terms from our pseudo recipe_types column, and classified the other recipe types as other. This was becuase many of the low frequency terms returned some nonsensical word that did not accurately represent the recipe at all. A list of some of those top words are presented below:

word	count
other	3088
potatoes	230
rice	224
beans	124
coffee	101

Table 1: Top 5 TF-IDF classifications by frequency

It’s important to note here that a lot of the values in recipe_type may not actually have names representative of the actual recipe, but may be associated with the recipe, ie. a major ingredient used in the recipe or something similar. Thus if a user doesn’t find the name of their recipe in the column, they may use a common ingredient found in their recipe that is found in the column. Otherwise they must select the other option when using our model.

Here’s are all the features included in X_recipes, the final cleaned dataframe that is used in the rest of our project.

Column Name	Description
name	Name of the recipe
minutes	Total preparation time (in minutes)
n_steps	Number of preparation steps
n_ingredients	Number of ingredients used
rating	User rating of the recipe
calories	Estimated calories per serving
total_fat_g	Total fat content (grams)
saturated_fat_g	Saturated fat content (grams)
sugar_g	Sugar content (grams)
protein_g	Protein content (grams)
sodium_mg	Sodium content (milligrams)
carbohydrates_g	Carbohydrates content (grams)
recipe_type	Category/type of recipe

Exploring the `minutes` column

Now, let’s dive into the data! First up: cooking time (minutes). How are recipe durations distributed? Is there a typical cook time most recipes fall under? In a histogram of minutes, we might see a peak around shorter times (e.g. many recipes take 20-40 minutes) and a long tail of recipes that take hours.

Figure 4: Distribution of Minutes

The distribution of recipe cooking times is right-skewed, with most recipes taking under 60 minutes. This tells us that quick meals dominate the dataset, with some rough peaks aroung the 30-40 min range.

We can also see if certain types of recipes tend to take longer. For example, a boxplot of minutes grouped by recipe_type could show that desserts versus main dishes have different prep time distributions. We can see that there’s definite variance between the different recipe type’s. A table grouped by recipe types showing the mean of every recipe type in ascending order shows this a bit better.

Figure 5: Minutes by Recipe Type

recipe_type	minutes	recipe_type	minutes	recipe_type	minutes
dough	38.9268	spinach	25.8191	sauce	21.4756
potatoes	38.1304	eggs	25.5667	other	20.4615
rice	32.7366	shrimp	25.5432	lemon	20.3387
soup	31.2133	pancakes	25.2162	mint	16.2982
chicken	30.8681	beans	25.1371	hummus	15.6949
pizza	30.7119	fish	24.8913	salsa	15.6364
beef	30.2439	tomatoes	24.5263	chocolate	15.5625
mushrooms	30.2	zucchini	24.4043	salad	14.9028
turkey	29.5405	broccoli	23.8046	dressing	11.4795
bacon	29.1964	tofu	23.75	lime	10
pasta	28.459	corn	23.0204	coffee	8.15842
pumpkin	28.2292	cheese	22.65	drink	3.9
flour	27.6122	asparagus	22.38	cocktail	3.62162
bread	25.9518	sesame	22.122	nan	nan

Table 2: Average cooking time by recipe type

What about relationships between minutes and other numeric features? Intuitively, recipes with more steps or ingredients might take more time. We explore scatter plots of minutes vs. n_steps (number of steps in the instructions) and vs. n_ingredients. As expected, there is a slight upward trend: recipes with more steps and ingredients do tend to require more minutes. It’s not a perfect correlation, but the positive association is there.

Figure 6: Scatter Plot of minutes vs n_step

Figure 7: Scatter Plot of minutes vs n_ingredients

Framing a Prediction Problem

After exploring, we decided our goal is to predict cooking time (minutes). Cooking time is more directly useful to someone planning a meal – knowing if a dish takes 15 minutes versus 2 hours is valuable! Ratings are interesting but subjective, and calorie counts depend a lot on portion sizes, which vary by recipe.

So, formal prediction question: Given a recipe’s attributes (ingredients, steps, nutritional info, etc.), can we accurately predict how many minutes it will take to make?

We’ll treat this as a regression problem, where the target variable is minutes.

We used standard multiple linear regression techniques built in scikit learn to achieve this task. A standard metric was chosen of MSE (Mean Squared Error) to evaluate the validity of our model.

To clarify, the features we plan to use include things like: the number of steps, number of ingredients, average user rating, nutritional stats (calories, fat, sugar, etc.), and the recipe type/category. We suspect these factors collectively influence cook time. For example, more steps or ingredients might mean more prep work, affecting the minutes. The recipe type might capture aspects like cooking techniques (a “bake” might take longer than a “salad”). Our hope is that by feeding all this information into a model, it can learn the typical patterns and give a reasonable time estimate for a new recipe.

Baseline Model

Our first attempt is a straightforward baseline model. We chose a simple multiple linear regression using a few key features that we believed would be most predictive:

n_steps, a discrete numerical value describing the number of steps needed to complete a recipe
calories, a continuous numerical value
recipe_type, a nominal categorical feature that we’ll one hot encode

Using these features, we fit a linear regression on 80% of the data (with 20% held out for testing). This is our baseline for comparison.

Baseline performance: The baseline model’s predictions turned out okay but not amazing. The Mean Squared Error (MSE) on the test set was about 197.5. In more intuitive terms, that means the root mean squared error is around $\sqrt{197} \approx 14 \ \text{minutes}$. So on average, our baseline predictions are about 13 minutes off from the actual time. That’s a sizable gap — if you’re expecting a 30-minute meal, it might actually take 43 minutes, which is the difference between a quick dinner and a long one! Clearly, there’s room for improvement.

We also looked at the learned coefficients to interpret the baseline model. The coefficients suggested that recipes with more steps do take longer (each additional step adds roughly 1.3 minutes on average, according to the model). Surprisingly, the calorie count had a smaller effect (the coefficient for calories was very low, implying that an extra 100 calories only adds about 1 minute of cook time). This makes sense: calories are more about ingredients than process. The recipe type dummy variables showed slight shifts; for example, the model might have given a positive bump to categories like “roast” (meaning if a recipe is a roast, it predicts a longer time, all else equal) and a negative bump to quick categories like “salad.” However, many of those category effects weren’t very large in the linear model.

feature	weight	feature	weight	feature	weight
bacon	4.74	eggs	2.65	rice	9.5
beans	4.29	fish	4.2	salad	-2.91
beef	8.84	flour	5.91	salsa	-2.9
bread	3.99	hummus	-4.96	sauce	1.95
broccoli	2.01	lemon	-0.56	sesame	0.91
cheese	1.22	lime	-7.55	shrimp	1.53
chicken	6.62	mint	-0.66	soup	9.38
chocolate	-3.71	mushrooms	7.95	spinach	2.71
cocktail	-12.2	other	0.11	tofu	-0.23
coffee	-8.58	pancakes	2.97	tomatoes	3.18
corn	1.35	pasta	3.77	turkey	6.37
dough	4.01	pizza	4.53	zucchini	3.03
dressing	-5.83	potatoes	14.37	calories	0.01
drink	-11.58	pumpkin	6.9	n_steps	1.27

Table 3: Features and their respective weights computed by our model

Figure 8: Overlaid predicition of our baseline model

In summary, the baseline linear model captures some obvious signals (steps matter!), but it’s not very accurate yet. We’ll use this as a reference point for building a better model.

Final Model

To improve on the baseline, we pulled out all the stops in feature engineering and modeling. Our final approach is still a regression model with all the features in our dataframe: , but with several enhancements:

Native Features	Description
`minutes`	Number of minutes taken to complete a recipe, numerical continous data
`rating`	Rating of a recipe from 1-5, numerical discrete
`calories`	Number of calories in a recipe, numerical continuous
`total_fat_g`	Total grams of fat in a recipe, numerical continuous
`saturated_fat_g`	Grams of saturated fat in a recipe, numerical continuous
`sugar_g`	Grams of (added) sugar in a recipe, numerical continuous
`sodium_mg`	Milligrams of sodium in a recipe, numerical continuous
`protein_g`	Grams of protein in a recipe, numerical continuous
`carbohydrates_g`	Grams of carbohydrates in a recipe, numerical continuous
`n_steps`	Number of steps to complete a recipe, numerical discrete
`n_ingredients`	Number of ingredients in a recipe, numerical discrete

Derived Features: We created new features such as normalized nutrition stats. For each recipe, we took nutritional values like fat, sugar, etc., converted to calories and divided them by total calories to get the proportion of . The idea is to capture the composition of the recipe (e.g., how sugary it is) rather than absolute calories, since absolute calories already correlate with other things. This gives features like “sugar per calorie” which might relate to recipe type (desserts vs. savory).
Binned Features: We used a KBinsDiscretizer to turn some numerical features into categorical buckets, and one hot encoded them. For example, we binned n_steps, n_ingredients, and calories into ranges (10 bins each). This can allow the model to learn non-linear relationships (maybe recipes with 1-5 steps aren’t that different, but beyond 10 steps, the time jumps a lot – a linear model might miss that if not binned).
Quantile Transformation: A QuantileTransformer was used to transform our data to fit more of a normal distribution. This type of transformation is known to work quite well with skewed data, which many of our features are. It’s also a monotonic transformation-it keeps the relative ordering of the data consistent.
Polynomial/Interaction Features: We considered that certain combinations of features or non-linear patterns could affect cook time. To allow for this, we added polynomial features (up to degree 3 for the numeric features). This means the model can account for, say, the squared effect of n_steps or an interaction between n_steps and n_ingredients if it helps.
Log Transforms: We introduced the possibility of log-transformed features. For instance, perhaps beyond a certain point, adding more steps has diminishing returns (a log relationship). We set up custom transformers that could apply a np.log1p(x) to features, with tunable parameters to adjust the curve. These were experimental features to see if any non-linear relationship would significantly improve the fit. The function itself looks something like this: $\text{Log Transform}(X) = a\log{(1+X)}$
Where $a$ is the decay/growth rate, and 1 is added to avoid division by 0.
Model Choice: We didn’t want to just assume linear regression is best. We also considered Ridge regression (a regularized linear model) to potentially handle multicollinearity among our many features and to prevent overfitting given the polynomial expansion. We treated the choice of using a plain Linear Regression vs. a Ridge regression (and the ridge penalty alpha value) as something to tune.

Whew! That’s a lot of moving parts. To manage this systematically, we set up a machine learning pipeline and used Grid Search with cross-validation to tune the hyperparameters. The hyperparameters we tuned included: the polynomial degree (how complex the interactions could get), the parameter a in our custom log transformer (to adjust their shape), the number of bins (n_bins) used by our KBinsDiscretizer, and the Ridge alpha (if Ridge was used). We used 5-fold cross-validation on the training set to try out different combinations and find what worked best.

Hyperparameter	Range Chosen	Optimal Parameter
`a` (log decay/growth)	[-4,-3,…,3]	-3
`n_bins`	[1,2,3,…,10]	9
`degree`	[1,2,3]	1
`alpha`	[0.01,0.1,1,5,10]	10

Table 4: Chosen Hyper parameters and their optimal values

After quite a bit of number crunching, we arrived at a final model. Interestingly, the best combination of transformations was to use a logarithmic transform (with a relatively high decay rate parameter) on certain features, along with polynomial features of degree 1. (It’s hard to interpret exactly what this means physically, but it suggests some diminishing returns in one area and maybe an overall linear growth in others). The best model ended up being a plain ridge regression with an alpha value of 10 (linear regression didn’t outperform it, though it was very close, implying our features were not causing huge overfitting issues).

Final model performance: Drumroll, please… The improved model brought the test MSE down to about 145. That’s roughly a 15% reduction in MSE compared to the baseline (197 → 166). In terms of root mean squared error, we went from ~14.8 minutes off to ~12 minutes off. So we gained about a one-minute improvement in average error by all that fancy feature engineering and tuning. Not a huge drop, but an improvement nonetheless!

	Train MSE	Test MSE
Baseline Model	192.035	197.483
Linear Regression w Poly feat.	164.349	164.542
Linear Regression w Log feat.	164.873	170.878
Ridge Regression w Poly feat.	166.864	166.635
Ridge Regression w Log feat.	166.864	166.633
Ridge Regression w both feat.	166.865	166.634

Table 5: All models trained and their respective training and testing MSE's

We should ask: is this level of error acceptable? 13-14 minutes uncertainty for a recipe’s cook time might be okay for some scenarios (predicting ~30 min vs actual 45 min is not too bad), but it’s still quite high if someone needs a very accurate estimate. It seems that predicting minutes is inherently tricky with the given data. Recipes can always have unobserved factors (technique difficulty, user skill, etc.) that affect prep time. Our model captures the obvious factors, but the variability in cooking is large.

In the end, we chose to stick with the ridge regression model for our final solution (instead of the Ridge), since it performed essentially the same. The added regularization didn’t yield a noticeable benefit, and the simpler model is easier to interpret and explain.

Reflection: So, how long is dinner gonna take? Our final model can give an estimate, but expect an error margin of about ±12 minutes. Not perfect, but better than a blind guess. Importantly, we confirmed some common-sense insights: the number of steps in a recipe is a strong indicator of prep time, and our model uses that heavily. We also learned that beyond a point, extra ingredients or calories don’t linearly increase cook time (hence the model leveraging log/exponential features). Perhaps truly nailing the prediction would require more detailed features (like parsing the text of the recipe for specific difficult techniques, or accounting for whether multiple steps can be done in parallel, etc.). Those are complexities beyond our current scope, but they hint at why this problem is challenging.

On the bright side, if you’re ever unsure how long a recipe will take, our project shows you can plug in the recipe’s details and get a ballpark time estimate. It might just save you from starting a 2-hour recipe when you only have 30 minutes before dinner! Bon appétit!