Fix: scikit-learn Not Working — NotFittedError, NaN Input, Pipeline, and ConvergenceWarning

Q: How do I fix "scikit-learn Not Working — NotFittedError, NaN Input, Pipeline, and ConvergenceWarning"?

How to fix scikit-learn errors — NotFittedError call fit before predict, ValueError Input contains NaN, could not convert string to float, Pipeline ColumnTransformer mistakes, cross-validation leakage, n_jobs hanging on Windows, and ConvergenceWarning.

The Error

You call predict() and scikit-learn refuses:

NotFittedError: This StandardScaler instance is not fitted yet.
Call 'fit' with appropriate arguments before using this estimator.

Or your model won’t train because of missing data:

ValueError: Input X contains NaN, infinity or a value too large for dtype('float64').

Or the data preparation step explodes because you forgot to encode categoricals:

ValueError: could not convert string to float: 'male'

Or your Pipeline raises a shape mismatch you can’t trace back to any single step.

Or GridSearchCV with n_jobs=-1 silently hangs on Windows and never returns.

scikit-learn’s API is consistent by design, but that consistency hides several non-obvious failure modes — particularly around Pipelines, data leakage, and categorical encoding. This guide covers all of them.

Why This Happens

scikit-learn follows a strict fit/transform/predict contract: every estimator must be fitted on training data before it can transform or predict. The library trusts that you’ve cleaned your data — it doesn’t impute missing values or encode strings automatically. The Pipeline API is intentionally strict about step ordering and interface compatibility.

Most errors come from violating one of three rules: fitting before calling transform/predict, passing clean training data but dirty test data, or applying transformations outside a Pipeline in a way that leaks test information into training.

Fix 1: `NotFittedError` — Call `fit()` First

NotFittedError: This StandardScaler instance is not fitted yet.

Every scikit-learn estimator — transformers, classifiers, clusterers — must be fitted before it can do anything else:

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np

X_train = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
X_test  = np.array([[2.0, 3.0], [4.0, 5.0]])
y_train = np.array([0, 1, 0])

# WRONG — transform before fit
scaler = StandardScaler()
X_scaled = scaler.transform(X_train)   # NotFittedError

# CORRECT — fit first, then transform
scaler = StandardScaler()
scaler.fit(X_train)                    # Learns mean and std from training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)   # Applies training stats to test data

# OR — fit_transform combines both steps (training data only)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)   # Never fit_transform on test data

A critical rule: fit_transform() is a shortcut for fit() + transform(). Use it only on training data. On test data, always use transform() alone — the test set must be scaled using the statistics learned from the training set, not from itself.

Check if an estimator is fitted without triggering an error:

from sklearn.utils.validation import check_is_fitted
from sklearn.exceptions import NotFittedError

scaler = StandardScaler()
try:
    check_is_fitted(scaler)
    print("Fitted")
except NotFittedError:
    print("Not fitted")

Fix 2: `ValueError: Input X Contains NaN` — Handle Missing Values First

ValueError: Input X contains NaN, infinity or a value too large for dtype('float64').

scikit-learn estimators don’t handle missing values by default — they raise this error on any NaN in the input. You must impute or drop before fitting.

SimpleImputer is the standard fix:

from sklearn.impute import SimpleImputer
import numpy as np

X = np.array([[1.0, 2.0], [np.nan, 3.0], [5.0, np.nan]])

# Strategy options: 'mean', 'median', 'most_frequent', 'constant'
imputer = SimpleImputer(strategy='mean')
X_clean = imputer.fit_transform(X)

print(X_clean)
# [[1. 2. ]
#  [3. 2.5]   ← column means used for NaN
#  [5. 2.5]]

KNNImputer uses neighboring rows to fill — better for correlated features:

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=3)
X_clean = imputer.fit_transform(X)

Pro Tip: Use KNNImputer when features are correlated — it captures relationships between columns that SimpleImputer’s per-column means miss entirely. On large datasets it’s slower, so SimpleImputer(strategy='median') is the practical default when you have millions of rows.

Check for NaN before fitting:

import numpy as np
import pandas as pd

# For NumPy arrays
print(np.isnan(X).sum())               # Total NaN count
print(np.isnan(X).any(axis=0))         # NaN per column

# For Pandas DataFrames
print(df.isna().sum())                 # NaN per column
print(df[df.isna().any(axis=1)])       # Rows with any NaN

Infinity causes the same error. Replace it before fitting:

import numpy as np

X[np.isinf(X)] = np.nan               # Convert inf to NaN first
imputer = SimpleImputer(strategy='mean')
X_clean = imputer.fit_transform(X)

Fix 3: `could not convert string to float` — Encode Categoricals

ValueError: could not convert string to float: 'male'

sklearn estimators require numerical input. String columns must be encoded first.

OneHotEncoder for nominal categories (no natural order):

from sklearn.preprocessing import OneHotEncoder
import numpy as np

X_cat = np.array([['male'], ['female'], ['male'], ['female']])

encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)  # sparse_output requires sklearn 1.2+; use sparse=False on older versions
X_encoded = encoder.fit_transform(X_cat)

print(X_encoded)
# [[0. 1.]
#  [1. 0.]
#  [0. 1.]
#  [1. 0.]]

print(encoder.categories_)   # [array(['female', 'male'], dtype=object)]

OrdinalEncoder for ordinal categories (with natural order):

from sklearn.preprocessing import OrdinalEncoder

X_size = np.array([['small'], ['large'], ['medium'], ['large']])

encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
X_encoded = encoder.fit_transform(X_size)
# [[0.], [2.], [1.], [2.]]   ← respects the ordering you specified

LabelEncoder is for target y only — not for feature columns:

from sklearn.preprocessing import LabelEncoder

y = np.array(['cat', 'dog', 'cat', 'fish'])

# LabelEncoder — encodes 1D target arrays
le = LabelEncoder()
y_encoded = le.fit_transform(y)   # [0, 1, 0, 2]

# WRONG — don't use LabelEncoder on X features
# It doesn't handle multiple columns, and the order is arbitrary.
# Use OrdinalEncoder for features instead.

Common Mistake: Using LabelEncoder on feature columns produces arbitrary integer mappings that mislead models into assuming ordinal relationships between categories that don’t have them. Always use OneHotEncoder for nominal features, and OrdinalEncoder with an explicit categories list for ordered ones.

Fix 4: `Pipeline` and `ColumnTransformer` — The Correct Pattern

Pipelines prevent data leakage and keep preprocessing reproducible. The mistake most developers make is applying transformers outside the Pipeline:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
import pandas as pd

# Sample data with mixed types
df = pd.DataFrame({
    'age':    [25, 32, 47, 51, 23],
    'salary': [40000, 62000, 80000, 90000, 35000],
    'gender': ['male', 'female', 'male', 'female', 'male'],
    'city':   ['NY', 'LA', 'NY', 'Chicago', 'LA'],
})
y = [0, 1, 1, 0, 1]

numeric_features = ['age', 'salary']
categorical_features = ['gender', 'city']

# Build the preprocessing step
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features),
])

# Full pipeline: preprocessing + model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier',   LogisticRegression(max_iter=1000)),
])

# Train
pipeline.fit(df, y)

# Predict on new data — preprocessing is applied automatically
new_data = pd.DataFrame({
    'age': [30], 'salary': [55000], 'gender': ['female'], 'city': ['NY']
})
print(pipeline.predict(new_data))   # [1]

Access intermediate steps for debugging:

# Get a step by name
scaler = pipeline.named_steps['preprocessor']

# Get the fitted transformer within ColumnTransformer
fitted_scaler = pipeline.named_steps['preprocessor'].named_transformers_['num']
print(fitted_scaler.mean_)    # [35.6, 61400.0]
print(fitted_scaler.scale_)   # [10.7, 20.....]

make_pipeline shorthand — auto-names steps from class names:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# Equivalent to Pipeline(steps=[('standardscaler', StandardScaler()), ('svc', SVC())])
pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', C=1.0))
pipe.fit(X_train, y_train)

remainder='passthrough' passes columns not listed in any transformer through unchanged:

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), ['gender']),
    ],
    remainder='passthrough'   # 'age', 'salary' pass through unchanged
)

Fix 5: Cross-Validation Data Leakage — Why Pipeline Matters

This is the most expensive scikit-learn mistake — it produces falsely optimistic validation scores that don’t reflect real performance:

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np

X = np.random.rand(100, 10)
y = np.random.randint(0, 2, 100)

# WRONG — leaks test data into training via the scaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)   # Scaler sees ALL data, including "test" folds!

scores = cross_val_score(LogisticRegression(), X_scaled, y, cv=5)
# These scores are inflated — test data statistics leaked into training

The scaler was fitted on all 100 samples before cross-validation. When CV holds out 20 samples as “test,” those 20 samples already influenced the scaler’s mean and standard deviation. The test set is no longer unseen.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# CORRECT — Pipeline re-fits the scaler inside each fold
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf',    LogisticRegression()),
])

scores = cross_val_score(pipeline, X, y, cv=5)
# Now each fold: fit scaler on 80 samples, transform test 20 samples with those stats
# No leakage — honest evaluation
print(scores.mean(), scores.std())

cross_validate returns more detail than cross_val_score:

from sklearn.model_selection import cross_validate

results = cross_validate(
    pipeline, X, y, cv=5,
    scoring=['accuracy', 'roc_auc'],
    return_train_score=True,
)

print(results['test_accuracy'])    # Per-fold test accuracy
print(results['train_accuracy'])   # Per-fold train accuracy (check for overfitting)

A large gap between train_accuracy and test_accuracy means the model is overfitting.

Fix 6: `n_jobs=-1` Hangs on Windows — Add the Multiprocessing Guard

GridSearchCV, RandomizedSearchCV, cross_val_score, and several estimators accept n_jobs=-1 to use all CPU cores. On Windows, this spawns new Python processes — and if the script doesn’t have the if __name__ == '__main__': guard, each spawned process re-runs the entire script, including the GridSearchCV call, causing an infinite fork loop.

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20)

param_grid = {'C': [0.1, 1, 10], 'kernel': ['rbf', 'linear']}

# WRONG on Windows — hangs indefinitely or throws RuntimeError
search = GridSearchCV(SVC(), param_grid, n_jobs=-1, cv=5)
search.fit(X, y)

# CORRECT — wrap in the multiprocessing guard
if __name__ == '__main__':
    search = GridSearchCV(SVC(), param_grid, n_jobs=-1, cv=5)
    search.fit(X, y)
    print(search.best_params_)

This guard tells Python: “only run this code in the main process, not in worker processes.” Without it, each spawned process immediately spawns more processes.

In Jupyter Notebooks on Windows, use n_jobs=1 or n_jobs=2 instead — the __main__ guard isn’t available in notebook cells. On Linux and macOS, n_jobs=-1 works everywhere without the guard (uses fork instead of spawn).

For the full explanation of Python’s multiprocessing behavior on Windows, see Python multiprocessing not working.

Fix 7: `ConvergenceWarning` — Model Didn’t Converge

ConvergenceWarning: Saga did not converge. See the solver attribute of the estimator for more information.
ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

The optimizer ran out of iterations before finding the minimum. This doesn’t crash — it silently returns a partially trained model that may perform worse than expected.

Fix 1: Increase max_iter:

from sklearn.linear_model import LogisticRegression

# Default max_iter is 100 — often not enough
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

Fix 2: Scale your features — unnormalized features with very different ranges (e.g., age 20–80 vs salary 30000–200000) cause gradient-based solvers to take much longer to converge:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Scaling almost always fixes ConvergenceWarning
model = make_pipeline(StandardScaler(), LogisticRegression(max_iter=300))
model.fit(X_train, y_train)

Fix 3: Try a different solver — some solvers converge faster depending on the dataset size:

# For large datasets: 'saga' or 'sag' (stochastic, scales better)
# For small-medium datasets: 'lbfgs' (default, good for multi-class)
# For binary classification with L1 regularization: 'liblinear'

model = LogisticRegression(solver='saga', max_iter=500)

Turn the warning into an error to catch it during development:

import warnings
from sklearn.exceptions import ConvergenceWarning

with warnings.catch_warnings():
    warnings.simplefilter('error', ConvergenceWarning)
    model.fit(X_train, y_train)   # Raises instead of warning

Fix 8: Feature Name Mismatch — scikit-learn 1.0+ Validation

scikit-learn 1.0 introduced feature name tracking. If you fit on a Pandas DataFrame, sklearn records the column names. Passing a NumPy array at predict time — or a DataFrame with different column names — triggers a warning or error:

import pandas as pd
from sklearn.preprocessing import StandardScaler

df_train = pd.DataFrame({'age': [25, 32], 'salary': [40000, 62000]})
scaler = StandardScaler()
scaler.fit(df_train)

# WARNING — array has no feature names, but scaler was fit on DataFrame
import numpy as np
X_array = np.array([[30, 55000]])
scaler.transform(X_array)
# UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names

# CORRECT option 1 — always use DataFrame at transform time
df_test = pd.DataFrame({'age': [30], 'salary': [55000]})
scaler.transform(df_test)   # No warning

# CORRECT option 2 — fit on array from the start
scaler2 = StandardScaler()
scaler2.fit(df_train.values)    # Fit on NumPy array — no name tracking
scaler2.transform(X_array)     # No warning

Column order matters — passing columns in a different order silently produces wrong results:

# WRONG — columns swapped; scaler applies wrong stats to each feature
df_wrong = pd.DataFrame({'salary': [55000], 'age': [30]})   # reversed!
scaler.transform(df_wrong)   # No error, but results are wrong

# CORRECT — always maintain the same column order used during fit
df_test = df_test[df_train.columns]   # Reorder to match training columns

sklearn 1.2+ DataFrame output with set_output:

from sklearn import set_config
from sklearn.preprocessing import StandardScaler
import pandas as pd

# All transformers return DataFrames instead of NumPy arrays
set_config(transform_output='pandas')

scaler = StandardScaler()
result = scaler.fit_transform(df_train)
print(type(result))    # <class 'pandas.core.frame.DataFrame'>
print(result.columns)  # Index(['age', 'salary'], dtype='object')

This preserves column names through pipelines, making debugging far easier.

Still Not Working?

`DataConversionWarning` — Target Shape

DataConversionWarning: A column-vector y was passed when a 1d array was expected.
Please change the shape of y to (n_samples,), using ravel().

Your target y has shape (n, 1) instead of (n,). Fix it with .ravel():

import numpy as np
import pandas as pd

# If y came from a DataFrame column selection
y_df = df[['target']]   # shape (n, 1) — 2D
y_df.values.ravel()     # shape (n,)   — 1D ✓

# If y is a NumPy 2D column vector
y_arr = np.array([[0], [1], [0], [1]])   # shape (4, 1)
y_arr.ravel()                             # shape (4,)  ✓

`GridSearchCV` — No Improvement Despite Tuning

If your best CV score looks the same across all parameter values, check:

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Symptom: all param combinations give the same score
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'gamma': ['scale', 'auto']}
search = GridSearchCV(SVC(), param_grid, cv=5, verbose=2)
search.fit(X_train, y_train)

# Check result distribution
import pandas as pd
results = pd.DataFrame(search.cv_results_)
print(results[['param_C', 'param_gamma', 'mean_test_score']].sort_values('mean_test_score', ascending=False))

A flat score usually means class imbalance or features that aren’t informative. Check class distribution with np.bincount(y_train) and consider class_weight='balanced'.

Saving and Loading Pipelines

import joblib

# Save the fitted pipeline (preserves all fitted parameters)
joblib.dump(pipeline, 'pipeline.joblib')

# Load and predict — no need to re-fit
pipeline_loaded = joblib.load('pipeline.joblib')
predictions = pipeline_loaded.predict(X_test)

Warning: Always re-train when upgrading scikit-learn versions. The internal representation of fitted models can change between minor versions, and a model trained on sklearn 1.2 may not load correctly on 1.4.

NumPy and Data Preparation

scikit-learn expects 2D arrays (n_samples, n_features). If your data arrives as 1D, reshape it before fitting. For the full guide on NumPy array shapes, dtype handling, and broadcasting that underlies all sklearn input preparation, see NumPy not working.

For loading and cleaning tabular data with Pandas before passing it to sklearn, the pandas merge key error article covers DataFrame joining patterns that commonly break at the train/test split boundary.

Pickle Load Fails With `ModuleNotFoundError`

joblib.load("model.joblib") raising ModuleNotFoundError: No module named 'sklearn.ensemble._forest' means the saved model references an internal sklearn module path that moved between versions. The fix is not to monkey-patch the import — it is to install the exact scikit-learn version that produced the file. Record sklearn.__version__ next to every persisted model and pin it in the consumer’s requirements.txt. For long-term storage, prefer ONNX (skl2onnx) — it serializes the inference graph and survives sklearn upgrades.

`RandomState` Gives Different Splits on Different Numpy Versions

np.random.RandomState(42) produces stable streams within a numpy major version, but numpy 2.0 changed several distribution implementations. The same random_state=42 in train_test_split can produce a slightly different split on numpy 1.26 vs numpy 2.0. For reproducible experiments across machines, use np.random.default_rng(42) and pass the generator explicitly where supported, or pin numpy alongside scikit-learn.

`class_weight='balanced'` Has No Effect

When class_weight='balanced' does not move the metrics, two things are usually wrong: the class imbalance is mild (under 1:5, where reweighting is too weak to matter) or you are using a tree-based model like RandomForestClassifier where the weights mostly affect leaf class probability, not split selection. For severe imbalance, class_weight={0: 1, 1: 20} (explicit weights) plus sample_weight at fit time gives more control. For very rare positives, switch to imbalanced-learn’s SMOTE or BalancedRandomForestClassifier instead.

Platform-Specific Differences

scikit-learn looks the same everywhere on the surface, but the runtime characteristics shift depending on version, OpenMP build, parallel backend, and whether you swap in a GPU drop-in. The same code can be 50x slower on the wrong combination.

0.24 vs 1.x API Drift

Anything from before scikit-learn 0.24 is now an active hazard. The largest 0.24 → 1.x breaks: sklearn.linear_model.LinearRegression(normalize=True) removed (use Pipeline with StandardScaler), OneHotEncoder(sparse=False) renamed to sparse_output=False in 1.2, plot_* functions moved under *.from_estimator, and sklearn.externals.joblib removed (import joblib directly). If you maintain code that must run on both 0.24 and 1.x, gate the imports with try/except ImportError rather than checking sklearn.__version__ strings — the deprecation path is more reliable.

OpenMP Thread Control Per Platform

Many sklearn estimators (HistGradientBoostingClassifier, KMeans, several tree models) parallelize via OpenMP, not joblib. n_jobs does not control them — you have to set the OpenMP environment variable before importing sklearn:

# Linux / macOS
export OMP_NUM_THREADS=4
# Windows PowerShell
$env:OMP_NUM_THREADS = "4"

On macOS, the wheels are built against libomp from Homebrew. If you also have libiomp5 from a conda Intel build in the same process (common with numpy from Anaconda + sklearn from pip), you get the OMP: Error #15: Initializing libiomp5.dylib, but found libomp.dylib already initialized crash. Pin sklearn and numpy to the same channel.

joblib Backend: loky vs threading vs multiprocessing

n_jobs is forwarded to joblib, which chooses a backend:

loky (default) — process pool with persistent workers and pickled inputs. Safe everywhere including Jupyter on Windows.
threading — releases the GIL only for estimators that drop it (mostly the C-level inner loops in RandomForest, KMeans). Use this when the estimator is GIL-aware and your data is huge — no pickling overhead.
multiprocessing — older fork-based backend. Hangs in notebooks, broken on Windows. Avoid.

Override with joblib.parallel_backend("threading"): around your fit call when you have measured loky’s pickle cost being the bottleneck. For deeper joblib pickling and backend failures, see joblib not working.

GPU: cuML and sklearn-intelex

cuML (RAPIDS) provides drop-in CUDA implementations of RandomForest, KMeans, LogisticRegression, and more — the fit / predict signatures match sklearn, but inputs must be cuDF DataFrames or CuPy arrays, not Pandas/NumPy, or the data gets shuttled host→device on every call. sklearn-intelex (Intel Extension) takes the opposite approach: from sklearnex import patch_sklearn; patch_sklearn() rewrites sklearn’s algorithms to use Intel’s oneDAL on CPU, with no code changes. Use cuML when you have an NVIDIA GPU and large data, sklearn-intelex on Intel-Xeon servers without a GPU. Neither helps on Apple Silicon today — stick with stock sklearn there.

Fix: scikit-learn Not Working — NotFittedError, NaN Input, Pipeline, and ConvergenceWarning

The Error

Why This Happens

Fix 1: `NotFittedError` — Call `fit()` First

Fix 2: `ValueError: Input X Contains NaN` — Handle Missing Values First

Fix 3: `could not convert string to float` — Encode Categoricals

Fix 4: `Pipeline` and `ColumnTransformer` — The Correct Pattern

Fix 5: Cross-Validation Data Leakage — Why Pipeline Matters

Fix 6: `n_jobs=-1` Hangs on Windows — Add the Multiprocessing Guard

Fix 7: `ConvergenceWarning` — Model Didn’t Converge

Fix 8: Feature Name Mismatch — scikit-learn 1.0+ Validation

Still Not Working?

`DataConversionWarning` — Target Shape

`GridSearchCV` — No Improvement Despite Tuning

Saving and Loading Pipelines

NumPy and Data Preparation

Pickle Load Fails With `ModuleNotFoundError`

`RandomState` Gives Different Splits on Different Numpy Versions

`class_weight='balanced'` Has No Effect

Platform-Specific Differences

0.24 vs 1.x API Drift

OpenMP Thread Control Per Platform

joblib Backend: loky vs threading vs multiprocessing

GPU: cuML and sklearn-intelex

Related Articles

Fix: Jupyter Notebook Not Working — Kernel Dead, Module Not Found, and Widget Errors

Fix: LightGBM Not Working — Installation Errors, Categorical Features, and Training Issues

Fix: NumPy Not Working — Broadcasting Error, dtype Mismatch, and Array Shape Problems

Fix: Streamlit Not Working — Session State, Cache, and Rerun Problems

The Error

Why This Happens

Fix 1: NotFittedError — Call fit() First

Fix 2: ValueError: Input X Contains NaN — Handle Missing Values First

Fix 3: could not convert string to float — Encode Categoricals

Fix 4: Pipeline and ColumnTransformer — The Correct Pattern

Fix 5: Cross-Validation Data Leakage — Why Pipeline Matters

Fix 6: n_jobs=-1 Hangs on Windows — Add the Multiprocessing Guard

Fix 7: ConvergenceWarning — Model Didn’t Converge

Fix 8: Feature Name Mismatch — scikit-learn 1.0+ Validation

Still Not Working?

DataConversionWarning — Target Shape

GridSearchCV — No Improvement Despite Tuning

Saving and Loading Pipelines

NumPy and Data Preparation

Pickle Load Fails With ModuleNotFoundError

RandomState Gives Different Splits on Different Numpy Versions

class_weight='balanced' Has No Effect

Platform-Specific Differences

0.24 vs 1.x API Drift

OpenMP Thread Control Per Platform

joblib Backend: loky vs threading vs multiprocessing

GPU: cuML and sklearn-intelex

Related Articles

Fix: Jupyter Notebook Not Working — Kernel Dead, Module Not Found, and Widget Errors

Fix: LightGBM Not Working — Installation Errors, Categorical Features, and Training Issues

Fix: NumPy Not Working — Broadcasting Error, dtype Mismatch, and Array Shape Problems

Fix: Streamlit Not Working — Session State, Cache, and Rerun Problems

Fix 1: `NotFittedError` — Call `fit()` First

Fix 2: `ValueError: Input X Contains NaN` — Handle Missing Values First

Fix 3: `could not convert string to float` — Encode Categoricals

Fix 4: `Pipeline` and `ColumnTransformer` — The Correct Pattern

Fix 6: `n_jobs=-1` Hangs on Windows — Add the Multiprocessing Guard

Fix 7: `ConvergenceWarning` — Model Didn’t Converge

`DataConversionWarning` — Target Shape

`GridSearchCV` — No Improvement Despite Tuning

Pickle Load Fails With `ModuleNotFoundError`

`RandomState` Gives Different Splits on Different Numpy Versions

`class_weight='balanced'` Has No Effect