Why Does Scikit-learn Demand Different Data Shapes For Different Regressors?

May 28, 2023 Post a Comment

I always find myself reshaping my data when I'm working with sklearn, and it's irritating and makes my code ugly. Why can't the library be made to work with a variety of data shape

Solution 1:

When you do y = np.random.rand(10), y is a one dimensional array of [10,]. It doesnt matter if its a row vector or column vector. Its just a vector with only one dimension. Take a look at this answer and this too to understand the philosophy behind it.

Its a part of "numpy philosophy". And sklearn depends on numpy.

As for your comment:-

why sklearn doesn't automatically understand that if I pass it something of the shape (n,) that n_samples=n and n_features=1

sklearn may not infer whether its n_samples=n and n_features=1 or other way around (n_samples=1 and n_features=n) based on X data alone. It may be done, if y is passed which may make it clear about the n_samples.

But that means changing all the code which relies on this type of semantics and that may break many things, because sklearn depends on numpy operations heavily.

You may also want to check the following links where similar issues are discussed.

Python Developer

Why Does Scikit-learn Demand Different Data Shapes For Different Regressors?

Solution 1:

Post a Comment for "Why Does Scikit-learn Demand Different Data Shapes For Different Regressors?"