Random Seed Chose Different Rows
I was applying .sample with random_state set to a constant and after using set_index it started selecting different rows. A member dropped that was previously included in the subse
Solution 1:
Applying .sort_index() after reading in the data and before performing .sample() corrected the issue. As long as the data remains the same, this will produce the same sample everytime.
Solution 2:
When sampling rows (without weight), the only things that matter are n, the number of rows, and whether or not you choose replacement. This generates the .iloc indices to take, regardless of the data.
For rows, sampling occurs as;
axis_length = self.shape[0] # DataFrame length
rs = pd.core.common.random_state(random_state)
locs = rs.choice(axis_length, size=n, replace=replace, p=weights) # np.random_choicereturnself.take(locs, axis=axis, is_copy=False)
Just to illustrate the point
Sample Data
import pandas as pd
import numpy as np
n = 100000
np.random.seed(123)
df = pd.DataFrame({'id': list(range(n)), 'gender': np.random.choice(['M', 'F'], n)})
df1 = pd.DataFrame({'id': list(range(n)), 'gender': ['M']},
index=np.random.choice(['foo', 'bar', np.NaN], n)).assign(blah=1)
Sampling will always choose row 42083 (integer array index): df.iloc[42803] for this seed and length:
df.sample(n=1, random_state=123)
# id gender#42083 42083 M
df1.sample(n=1, random_state=123)
# id gender blah#foo 42083 M 1
df1.reset_index().shift(10).sample(n=1, random_state=123)
# index id gender blah#42083 nan 42073.0 M 1.0Even numpy:
np.random.seed(123)
np.random.choice(df.shape[0], size=1, replace=False)
#array([42083])
Post a Comment for "Random Seed Chose Different Rows"