Fairly new to Python but building out my first RF model based on some classification data. I’ve converted all of the labels into int64 numerical data and loaded into X and Y as a numpy array, but I am hitting an error when I am trying to train the models.
Here is what my arrays look like:
>>> X = np.array([[df.tran_cityname, df.tran_signupos, df.tran_signupchannel, df.tran_vmake, df.tran_vmodel, df.tran_vyear]])
>>> Y = np.array(df['completed_trip_status'].values.tolist())
>>> X
array([[[ 1, 1, 2, 3, 1, 1, 1, 1, 1, 3, 1,
3, 1, 1, 1, 1, 2, 1, 3, 1, 3, 3,
2, 3, 3, 1, 1, 1, 1],
[ 0, 5, 5, 1, 1, 1, 2, 2, 0, 2, 2,
3, 1, 2, 5, 5, 2, 1, 2, 2, 2, 2,
2, 4, 3, 5, 1, 0, 1],
[ 2, 2, 1, 3, 3, 3, 2, 3, 3, 2, 3,
2, 3, 2, 2, 3, 2, 2, 1, 1, 2, 1,
2, 2, 1, 2, 3, 1, 1],
[ 0, 0, 0, 42, 17, 8, 42, 0, 0, 0, 22,
0, 22, 0, 0, 42, 0, 0, 0, 0, 11, 0,
0, 0, 0, 0, 28, 17, 18],
[ 0, 0, 0, 70, 291, 88, 234, 0, 0, 0, 222,
0, 222, 0, 0, 234, 0, 0, 0, 0, 89, 0,
0, 0, 0, 0, 40, 291, 131],
[ 0, 0, 0, 2016, 2016, 2006, 2014, 0, 0, 0, 2015,
0, 2015, 0, 0, 2015, 0, 0, 0, 0, 2015, 0,
0, 0, 0, 0, 2016, 2016, 2010]]])
>>> Y
array(['NO', 'NO', 'NO', 'YES', 'NO', 'NO', 'YES', 'NO', 'NO', 'NO', 'NO',
'NO', 'YES', 'NO', 'NO', 'YES', 'NO', 'NO', 'NO', 'NO', 'NO', 'NO',
'NO', 'NO', 'NO', 'NO', 'NO', 'NO', 'NO'],
dtype='|S3')
>>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
Traceback (most recent call last):
File "<stdin>", line 1, in <module> File "/Library/Python/2.7/site-packages/sklearn/cross_validation.py", line
2039, in train_test_split
arrays = indexable(*arrays)
File «/Library/Python/2.7/site-packages/sklearn/utils/validation.py», line
206, in indexable
check_consistent_length(*result)
File «/Library/Python/2.7/site-packages/sklearn/utils/validation.py», line
181, in check_consistent_length
» samples: %r» % [int(l) for l in lengths])ValueError: Found input variables with inconsistent numbers of samples: [1, 29]
It would be really helpful if someone could help me understand this error and what do I do to fix it? I cannot change my data.
X = train[['id', 'listing_type', 'floor', 'latitude', 'longitude',
'beds', 'baths','total_rooms','square_feet','group','grades']]
Y = test['price']
n = pd.get_dummies(train.group)
Below is how the training data looks like:
id listing_type floor latitude longitude beds baths total_rooms square_feet grades high_price_high_freq high_price_low_freq low_price
265183 10 4 40.756224 -73.962506 1 1 3 790 2 1 0 0 0
270356 10 7 40.778010 -73.962547 5 5 9 4825 2 1 0 0
176718 10 25 40.764955 -73.963483 2 2 4 1645 2 1 0 0
234589 10 5 40.741448 -73.994216 3 3 5 2989 2 1 0 0
270372 10 5 40.837000 -73.947787 1 1 3 1045 2 0 0 1
The error code is:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-479-ca78b7b5f096> in <module>()
1 from sklearn.cross_validation import train_test_split
----> 2 X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
3 from sklearn.linear_model import LinearRegression
4 regressor = LinearRegression()
5 regressor.fit(X_train, y_train)
~Anaconda3libsite-packagessklearncross_validation.py in train_test_split(*arrays, **options)
2057 if test_size is None and train_size is None:
2058 test_size = 0.25
-> 2059 arrays = indexable(*arrays)
2060 if stratify is not None:
2061 cv = StratifiedShuffleSplit(stratify, test_size=test_size,
~Anaconda3libsite-packagessklearnutilsvalidation.py in indexable(*iterables)
227 else:
228 result.append(np.array(X))
--> 229 check_consistent_length(*result)
230 return result
231
~Anaconda3libsite-packagessklearnutilsvalidation.py in check_consistent_length(*arrays)
202 if len(uniques) > 1:
203 raise ValueError("Found input variables with inconsistent numbers of"
--> 204 " samples: %r" % [int(l) for l in lengths])
205
206
ValueError: Found input variables with inconsistent numbers of samples: [2750, 1095]
I am working on a small test data. I am getting a ValueError: Found input variables with inconsistent numbers of samples: [5, 6]. How can I make the X and y shapes to be the same size. I added the line;
dataset.dropna(inplace=True)
to drop NA values so that the two samples become the same size. However I still get the Value Error. The code is;
# Importing Libraries import numpy as np import pandas as pd # Import dataset dataset = pd.read_csv("../output.tsv", delimiter = 't') # library to clean data import re # Natural Language Tool Kit import nltk nltk.download('stopwords') # to remove stopword from nltk.corpus import stopwords # for Stemming propose from nltk.stem.porter import PorterStemmer # Initialize empty array # to append clean text corpus = [] # 1000 (reviews) rows to clean for i in range(0, 5): # column : "Review", row ith review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) # convert all cases to lower cases review = review.lower() # split to array(default delimiter is " ") review = review.split() # creating PorterStemmer object to # take main stem of each word ps = PorterStemmer() # loop for stemming each word # in string array at ith row review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))] # rejoin all string array elements # to create back into a string review = ' '.join(review) # append each string to create # array of clean text corpus.append(review) # Creating the Bag of Words model from sklearn.feature_extraction.text import CountVectorizer # To extract max 1500 feature. # "max_features" is attribute to # experiment with to get better results cv = CountVectorizer(max_features = 9) # X contains corpus (dependent variable) X = cv.fit_transform(corpus).toarray() # y contains answers if review # is positive or negative y = dataset.iloc[:, 1].values # Splitting the dataset into # the Training set and Test set from sklearn.model_selection import train_test_split dataset.dropna(inplace=True) print(X.shape) print(y.shape) # experiment with "test_size" # to get better results X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) print(X_train.shape) print(y_train.shape)
The Output from the code (for X shape and y shape) is
(5, 9)
(6,)
Error is ValueError: Found input variables with inconsistent numbers of samples: [5, 6]
Posts: 11,572
Threads: 446
Joined: Sep 2016
Reputation:
444
Please, always show complete, unmodified error traceback
It contains valuable debugging information.
Posts: 10
Threads: 6
Joined: Nov 2019
Reputation:
0
Nov-07-2019, 04:41 PM
(This post was last modified: Nov-07-2019, 05:22 PM by Larz60+.)
The full output is;
Error:
(5, 9)
(6,)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-65-67c92addcc9a> in <module>
82 # experiment with "test_size"
83 # to get better results
---> 84 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
85 print(X_train.shape)
86 print(y_train.shape)
~Anaconda3libsite-packagessklearnmodel_selection_split.py in train_test_split(*arrays, **options)
2094 raise TypeError("Invalid parameters passed: %s" % str(options))
2095
-> 2096 arrays = indexable(*arrays)
2097
2098 n_samples = _num_samples(arrays[0])
~Anaconda3libsite-packagessklearnutilsvalidation.py in indexable(*iterables)
228 else:
229 result.append(np.array(X))
--> 230 check_consistent_length(*result)
231 return result
232
~Anaconda3libsite-packagessklearnutilsvalidation.py in check_consistent_length(*arrays)
203 if len(uniques) > 1:
204 raise ValueError("Found input variables with inconsistent numbers of"
--> 205 " samples: %r" % [int(l) for l in lengths])
206
207
ValueError: Found input variables with inconsistent numbers of samples: [5, 6]
Posts: 21
Threads: 6
Joined: Sep 2019
Reputation:
0
Nov-07-2019, 07:43 PM
(This post was last modified: Nov-07-2019, 07:44 PM by kozaizsvemira.)
If I’m not mistaken your x_train and y_size are different sizes. They have to be equal. Somewhere your output is different than input, working with big data requires all samples to be equal due to the accuracy of tests.
Posts: 10
Threads: 6
Joined: Nov 2019
Reputation:
0
The range was incorrect. The file had 6 reviews but the code was;
for i in range(0, 5):
I have corrected that and code works fine
Posts: 1
Threads: 0
Joined: Jun 2020
Reputation:
0
(Nov-07-2019, 03:26 AM)bongielondy Wrote: I am working on a small test data. I am getting a ValueError: Found input variables with inconsistent numbers of samples: [5, 6]. How can I make the X and y shapes to be the same size. I added the line;
dataset.dropna(inplace=True)
to drop NA values so that the two samples become the same size. However I still get the Value Error. The code is;
# Importing Libraries import numpy as np import pandas as pd # Import dataset dataset = pd.read_csv("../output.tsv", delimiter = 't') # library to clean data import re # Natural Language Tool Kit import nltk nltk.download('stopwords') # to remove stopword from nltk.corpus import stopwords # for Stemming propose from nltk.stem.porter import PorterStemmer # Initialize empty array # to append clean text corpus = [] # 1000 (reviews) rows to clean for i in range(0, 5): # column : "Review", row ith review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) # convert all cases to lower cases review = review.lower() # split to array(default delimiter is " ") review = review.split() # creating PorterStemmer object to # take main stem of each word ps = PorterStemmer() # loop for stemming each word # in string array at ith row review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))] # rejoin all string array elements # to create back into a string review = ' '.join(review) # append each string to create # array of clean text corpus.append(review) # Creating the Bag of Words model from sklearn.feature_extraction.text import CountVectorizer # To extract max 1500 feature. # "max_features" is attribute to # experiment with to get better results cv = CountVectorizer(max_features = 9) # X contains corpus (dependent variable) X = cv.fit_transform(corpus).toarray() # y contains answers if review # is positive or negative y = dataset.iloc[:, 1].values # Splitting the dataset into # the Training set and Test set from sklearn.model_selection import train_test_split dataset.dropna(inplace=True) print(X.shape) print(y.shape) # experiment with "test_size" # to get better results X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) print(X_train.shape) print(y_train.shape)The Output from the code (for X shape and y shape) is
(5, 9)
(6,)Error is ValueError: Found input variables with inconsistent numbers of samples: [5, 6]
Posts: 3
Threads: 0
Joined: May 2021
Reputation:
0
I faced a similar problem while fitting a regression model . The problem in my case was, Number of rows in X was not equal to number of rows in y. You likely get problems because you remove rows containing nulls in X_train and y_train independent of each other. y_train probably has few, or no nulls and X_train probably has some. So when you remove a row in X_train and the same row is not removed in y_train it will cause your data to be unsynced and have different lenghts. Instead you should remove nulls before you separate X and y.