Fix: scikit-learn Not Working — NotFittedError, NaN Input, Pipeline, and ConvergenceWarning
Quick Answer
How to fix scikit-learn errors — NotFittedError call fit before predict, ValueError Input contains NaN, could not convert string to float, Pipeline ColumnTransformer mistakes, cross-validation leakage, n_jobs hanging on Windows, and ConvergenceWarning.
The Error
You call predict() and scikit-learn refuses:
NotFittedError: This StandardScaler instance is not fitted yet.
Call 'fit' with appropriate arguments before using this estimator.Or your model won’t train because of missing data:
ValueError: Input X contains NaN, infinity or a value too large for dtype('float64').Or the data preparation step explodes because you forgot to encode categoricals:
ValueError: could not convert string to float: 'male'Or your Pipeline raises a shape mismatch you can’t trace back to any single step.
Or GridSearchCV with n_jobs=-1 silently hangs on Windows and never returns.
scikit-learn’s API is consistent by design, but that consistency hides several non-obvious failure modes — particularly around Pipelines, data leakage, and categorical encoding. This guide covers all of them.
Why This Happens
scikit-learn follows a strict fit/transform/predict contract: every estimator must be fitted on training data before it can transform or predict. The library trusts that you’ve cleaned your data — it doesn’t impute missing values or encode strings automatically. The Pipeline API is intentionally strict about step ordering and interface compatibility.
Most errors come from violating one of three rules: fitting before calling transform/predict, passing clean training data but dirty test data, or applying transformations outside a Pipeline in a way that leaks test information into training.
Fix 1: NotFittedError — Call fit() First
NotFittedError: This StandardScaler instance is not fitted yet.Every scikit-learn estimator — transformers, classifiers, clusterers — must be fitted before it can do anything else:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np
X_train = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
X_test = np.array([[2.0, 3.0], [4.0, 5.0]])
y_train = np.array([0, 1, 0])
# WRONG — transform before fit
scaler = StandardScaler()
X_scaled = scaler.transform(X_train) # NotFittedError
# CORRECT — fit first, then transform
scaler = StandardScaler()
scaler.fit(X_train) # Learns mean and std from training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test) # Applies training stats to test data
# OR — fit_transform combines both steps (training data only)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Never fit_transform on test dataA critical rule: fit_transform() is a shortcut for fit() + transform(). Use it only on training data. On test data, always use transform() alone — the test set must be scaled using the statistics learned from the training set, not from itself.
Check if an estimator is fitted without triggering an error:
from sklearn.utils.validation import check_is_fitted
from sklearn.exceptions import NotFittedError
scaler = StandardScaler()
try:
check_is_fitted(scaler)
print("Fitted")
except NotFittedError:
print("Not fitted")Fix 2: ValueError: Input X Contains NaN — Handle Missing Values First
ValueError: Input X contains NaN, infinity or a value too large for dtype('float64').scikit-learn estimators don’t handle missing values by default — they raise this error on any NaN in the input. You must impute or drop before fitting.
SimpleImputer is the standard fix:
from sklearn.impute import SimpleImputer
import numpy as np
X = np.array([[1.0, 2.0], [np.nan, 3.0], [5.0, np.nan]])
# Strategy options: 'mean', 'median', 'most_frequent', 'constant'
imputer = SimpleImputer(strategy='mean')
X_clean = imputer.fit_transform(X)
print(X_clean)
# [[1. 2. ]
# [3. 2.5] ← column means used for NaN
# [5. 2.5]]KNNImputer uses neighboring rows to fill — better for correlated features:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3)
X_clean = imputer.fit_transform(X)Pro Tip: Use KNNImputer when features are correlated — it captures relationships between columns that SimpleImputer’s per-column means miss entirely. On large datasets it’s slower, so SimpleImputer(strategy='median') is the practical default when you have millions of rows.
Check for NaN before fitting:
import numpy as np
import pandas as pd
# For NumPy arrays
print(np.isnan(X).sum()) # Total NaN count
print(np.isnan(X).any(axis=0)) # NaN per column
# For Pandas DataFrames
print(df.isna().sum()) # NaN per column
print(df[df.isna().any(axis=1)]) # Rows with any NaNInfinity causes the same error. Replace it before fitting:
import numpy as np
X[np.isinf(X)] = np.nan # Convert inf to NaN first
imputer = SimpleImputer(strategy='mean')
X_clean = imputer.fit_transform(X)Fix 3: could not convert string to float — Encode Categoricals
ValueError: could not convert string to float: 'male'sklearn estimators require numerical input. String columns must be encoded first.
OneHotEncoder for nominal categories (no natural order):
from sklearn.preprocessing import OneHotEncoder
import numpy as np
X_cat = np.array([['male'], ['female'], ['male'], ['female']])
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False) # sparse_output requires sklearn 1.2+; use sparse=False on older versions
X_encoded = encoder.fit_transform(X_cat)
print(X_encoded)
# [[0. 1.]
# [1. 0.]
# [0. 1.]
# [1. 0.]]
print(encoder.categories_) # [array(['female', 'male'], dtype=object)]OrdinalEncoder for ordinal categories (with natural order):
from sklearn.preprocessing import OrdinalEncoder
X_size = np.array([['small'], ['large'], ['medium'], ['large']])
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
X_encoded = encoder.fit_transform(X_size)
# [[0.], [2.], [1.], [2.]] ← respects the ordering you specifiedLabelEncoder is for target y only — not for feature columns:
from sklearn.preprocessing import LabelEncoder
y = np.array(['cat', 'dog', 'cat', 'fish'])
# LabelEncoder — encodes 1D target arrays
le = LabelEncoder()
y_encoded = le.fit_transform(y) # [0, 1, 0, 2]
# WRONG — don't use LabelEncoder on X features
# It doesn't handle multiple columns, and the order is arbitrary.
# Use OrdinalEncoder for features instead.Common Mistake: Using LabelEncoder on feature columns produces arbitrary integer mappings that mislead models into assuming ordinal relationships between categories that don’t have them. Always use OneHotEncoder for nominal features, and OrdinalEncoder with an explicit categories list for ordered ones.
Fix 4: Pipeline and ColumnTransformer — The Correct Pattern
Pipelines prevent data leakage and keep preprocessing reproducible. The mistake most developers make is applying transformers outside the Pipeline:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
import pandas as pd
# Sample data with mixed types
df = pd.DataFrame({
'age': [25, 32, 47, 51, 23],
'salary': [40000, 62000, 80000, 90000, 35000],
'gender': ['male', 'female', 'male', 'female', 'male'],
'city': ['NY', 'LA', 'NY', 'Chicago', 'LA'],
})
y = [0, 1, 1, 0, 1]
numeric_features = ['age', 'salary']
categorical_features = ['gender', 'city']
# Build the preprocessing step
preprocessor = ColumnTransformer(transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features),
])
# Full pipeline: preprocessing + model
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(max_iter=1000)),
])
# Train
pipeline.fit(df, y)
# Predict on new data — preprocessing is applied automatically
new_data = pd.DataFrame({
'age': [30], 'salary': [55000], 'gender': ['female'], 'city': ['NY']
})
print(pipeline.predict(new_data)) # [1]Access intermediate steps for debugging:
# Get a step by name
scaler = pipeline.named_steps['preprocessor']
# Get the fitted transformer within ColumnTransformer
fitted_scaler = pipeline.named_steps['preprocessor'].named_transformers_['num']
print(fitted_scaler.mean_) # [35.6, 61400.0]
print(fitted_scaler.scale_) # [10.7, 20.....]make_pipeline shorthand — auto-names steps from class names:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
# Equivalent to Pipeline(steps=[('standardscaler', StandardScaler()), ('svc', SVC())])
pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', C=1.0))
pipe.fit(X_train, y_train)remainder='passthrough' passes columns not listed in any transformer through unchanged:
preprocessor = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(), ['gender']),
],
remainder='passthrough' # 'age', 'salary' pass through unchanged
)Fix 5: Cross-Validation Data Leakage — Why Pipeline Matters
This is the most expensive scikit-learn mistake — it produces falsely optimistic validation scores that don’t reflect real performance:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np
X = np.random.rand(100, 10)
y = np.random.randint(0, 2, 100)
# WRONG — leaks test data into training via the scaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Scaler sees ALL data, including "test" folds!
scores = cross_val_score(LogisticRegression(), X_scaled, y, cv=5)
# These scores are inflated — test data statistics leaked into trainingThe scaler was fitted on all 100 samples before cross-validation. When CV holds out 20 samples as “test,” those 20 samples already influenced the scaler’s mean and standard deviation. The test set is no longer unseen.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
# CORRECT — Pipeline re-fits the scaler inside each fold
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression()),
])
scores = cross_val_score(pipeline, X, y, cv=5)
# Now each fold: fit scaler on 80 samples, transform test 20 samples with those stats
# No leakage — honest evaluation
print(scores.mean(), scores.std())cross_validate returns more detail than cross_val_score:
from sklearn.model_selection import cross_validate
results = cross_validate(
pipeline, X, y, cv=5,
scoring=['accuracy', 'roc_auc'],
return_train_score=True,
)
print(results['test_accuracy']) # Per-fold test accuracy
print(results['train_accuracy']) # Per-fold train accuracy (check for overfitting)A large gap between train_accuracy and test_accuracy means the model is overfitting.
Fix 6: n_jobs=-1 Hangs on Windows — Add the Multiprocessing Guard
GridSearchCV, RandomizedSearchCV, cross_val_score, and several estimators accept n_jobs=-1 to use all CPU cores. On Windows, this spawns new Python processes — and if the script doesn’t have the if __name__ == '__main__': guard, each spawned process re-runs the entire script, including the GridSearchCV call, causing an infinite fork loop.
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20)
param_grid = {'C': [0.1, 1, 10], 'kernel': ['rbf', 'linear']}
# WRONG on Windows — hangs indefinitely or throws RuntimeError
search = GridSearchCV(SVC(), param_grid, n_jobs=-1, cv=5)
search.fit(X, y)
# CORRECT — wrap in the multiprocessing guard
if __name__ == '__main__':
search = GridSearchCV(SVC(), param_grid, n_jobs=-1, cv=5)
search.fit(X, y)
print(search.best_params_)This guard tells Python: “only run this code in the main process, not in worker processes.” Without it, each spawned process immediately spawns more processes.
In Jupyter Notebooks on Windows, use n_jobs=1 or n_jobs=2 instead — the __main__ guard isn’t available in notebook cells. On Linux and macOS, n_jobs=-1 works everywhere without the guard (uses fork instead of spawn).
For the full explanation of Python’s multiprocessing behavior on Windows, see Python multiprocessing not working.
Fix 7: ConvergenceWarning — Model Didn’t Converge
ConvergenceWarning: Saga did not converge. See the solver attribute of the estimator for more information.
ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.The optimizer ran out of iterations before finding the minimum. This doesn’t crash — it silently returns a partially trained model that may perform worse than expected.
Fix 1: Increase max_iter:
from sklearn.linear_model import LogisticRegression
# Default max_iter is 100 — often not enough
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)Fix 2: Scale your features — unnormalized features with very different ranges (e.g., age 20–80 vs salary 30000–200000) cause gradient-based solvers to take much longer to converge:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Scaling almost always fixes ConvergenceWarning
model = make_pipeline(StandardScaler(), LogisticRegression(max_iter=300))
model.fit(X_train, y_train)Fix 3: Try a different solver — some solvers converge faster depending on the dataset size:
# For large datasets: 'saga' or 'sag' (stochastic, scales better)
# For small-medium datasets: 'lbfgs' (default, good for multi-class)
# For binary classification with L1 regularization: 'liblinear'
model = LogisticRegression(solver='saga', max_iter=500)Turn the warning into an error to catch it during development:
import warnings
from sklearn.exceptions import ConvergenceWarning
with warnings.catch_warnings():
warnings.simplefilter('error', ConvergenceWarning)
model.fit(X_train, y_train) # Raises instead of warningFix 8: Feature Name Mismatch — scikit-learn 1.0+ Validation
scikit-learn 1.0 introduced feature name tracking. If you fit on a Pandas DataFrame, sklearn records the column names. Passing a NumPy array at predict time — or a DataFrame with different column names — triggers a warning or error:
import pandas as pd
from sklearn.preprocessing import StandardScaler
df_train = pd.DataFrame({'age': [25, 32], 'salary': [40000, 62000]})
scaler = StandardScaler()
scaler.fit(df_train)
# WARNING — array has no feature names, but scaler was fit on DataFrame
import numpy as np
X_array = np.array([[30, 55000]])
scaler.transform(X_array)
# UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
# CORRECT option 1 — always use DataFrame at transform time
df_test = pd.DataFrame({'age': [30], 'salary': [55000]})
scaler.transform(df_test) # No warning
# CORRECT option 2 — fit on array from the start
scaler2 = StandardScaler()
scaler2.fit(df_train.values) # Fit on NumPy array — no name tracking
scaler2.transform(X_array) # No warningColumn order matters — passing columns in a different order silently produces wrong results:
# WRONG — columns swapped; scaler applies wrong stats to each feature
df_wrong = pd.DataFrame({'salary': [55000], 'age': [30]}) # reversed!
scaler.transform(df_wrong) # No error, but results are wrong
# CORRECT — always maintain the same column order used during fit
df_test = df_test[df_train.columns] # Reorder to match training columnssklearn 1.2+ DataFrame output with set_output:
from sklearn import set_config
from sklearn.preprocessing import StandardScaler
import pandas as pd
# All transformers return DataFrames instead of NumPy arrays
set_config(transform_output='pandas')
scaler = StandardScaler()
result = scaler.fit_transform(df_train)
print(type(result)) # <class 'pandas.core.frame.DataFrame'>
print(result.columns) # Index(['age', 'salary'], dtype='object')This preserves column names through pipelines, making debugging far easier.
Still Not Working?
DataConversionWarning — Target Shape
DataConversionWarning: A column-vector y was passed when a 1d array was expected.
Please change the shape of y to (n_samples,), using ravel().Your target y has shape (n, 1) instead of (n,). Fix it with .ravel():
import numpy as np
import pandas as pd
# If y came from a DataFrame column selection
y_df = df[['target']] # shape (n, 1) — 2D
y_df.values.ravel() # shape (n,) — 1D ✓
# If y is a NumPy 2D column vector
y_arr = np.array([[0], [1], [0], [1]]) # shape (4, 1)
y_arr.ravel() # shape (4,) ✓GridSearchCV — No Improvement Despite Tuning
If your best CV score looks the same across all parameter values, check:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
# Symptom: all param combinations give the same score
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'gamma': ['scale', 'auto']}
search = GridSearchCV(SVC(), param_grid, cv=5, verbose=2)
search.fit(X_train, y_train)
# Check result distribution
import pandas as pd
results = pd.DataFrame(search.cv_results_)
print(results[['param_C', 'param_gamma', 'mean_test_score']].sort_values('mean_test_score', ascending=False))A flat score usually means class imbalance or features that aren’t informative. Check class distribution with np.bincount(y_train) and consider class_weight='balanced'.
Saving and Loading Pipelines
import joblib
# Save the fitted pipeline (preserves all fitted parameters)
joblib.dump(pipeline, 'pipeline.joblib')
# Load and predict — no need to re-fit
pipeline_loaded = joblib.load('pipeline.joblib')
predictions = pipeline_loaded.predict(X_test)Warning: Always re-train when upgrading scikit-learn versions. The internal representation of fitted models can change between minor versions, and a model trained on sklearn 1.2 may not load correctly on 1.4.
NumPy and Data Preparation
scikit-learn expects 2D arrays (n_samples, n_features). If your data arrives as 1D, reshape it before fitting. For the full guide on NumPy array shapes, dtype handling, and broadcasting that underlies all sklearn input preparation, see NumPy not working.
For loading and cleaning tabular data with Pandas before passing it to sklearn, the pandas merge key error article covers DataFrame joining patterns that commonly break at the train/test split boundary.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: Jupyter Notebook Not Working — Kernel Dead, Module Not Found, and Widget Errors
How to fix Jupyter errors — kernel fails to start or dies, ModuleNotFoundError despite pip install, matplotlib plots not showing, ipywidgets not rendering in JupyterLab, port already in use, and jupyter command not found.
Fix: LightGBM Not Working — Installation Errors, Categorical Features, and Training Issues
How to fix LightGBM errors — ImportError libomp libgomp not found, do not support special JSON characters in feature name, categorical feature index out of range, num_leaves vs max_depth overfitting, early stopping callback changes, and GPU build errors.
Fix: NumPy Not Working — Broadcasting Error, dtype Mismatch, and Array Shape Problems
How to fix NumPy errors — ValueError operands could not be broadcast together, setting an array element with a sequence, integer overflow, axis confusion, view vs copy bugs, NaN handling, and NumPy 1.24+ removed type aliases.
Fix: Streamlit Not Working — Session State, Cache, and Rerun Problems
How to fix Streamlit errors — session state KeyError state not persisting, @st.cache deprecated migrate to cache_data cache_resource, file upload resetting, slow app loading on every interaction, secrets not loading, and widget rerun loops.