Python Generate Unique Ranges Of A Specific Length And Categorize Them
Solution 1:
The natural way to do that would be to split the data into 5 quanties, and then split the data into bins based on these quantities. Luckily, pandas allows you do easily do that:
df["category"] = pd.cut(df.Activity, 5, labels= ["a","b", "c", "d", "e"])
The output is something like:
Activity Category
34115b1543a57192 d
78271 e
2688b625a55186 d
63220 d
115a76268 e
An alternative view - clustering
In the above method, we've split the data into 5 bins, where the sizes of the different bins are equal. An alternative, more sophisticated approach, would be to split the data into 5 clusters and aim to have the data points in each cluster as similar to each other as possible. In machine learning, this is known as a clustering / classification problem.
One classic clustering algorithm is k-means. It's typically used for data with multiple dimensions (e.g. monthly activity, age, gender, etc.) This is, therefore, a very simplistic case of clustering.
In this case, k-means clustering can be done in the following way:
import scipy
from scipy.cluster.vq import vq, kmeans, whiten
df = pd.DataFrame({"Activity": l})
features = np.array([[x] forx in df.Activity])
whitened = whiten(features)
codebook, distortion = kmeans(whitened, 5)
code, dist = vq(whitened, codebook)
df["Category"] = code
And the output looks like:
Activity Category
40 138 1
79 272 0
72 255 0
13 38 3
41 139 1
65 231 0
26 88 2
59 197 4
76 268 0
45 145 1
A couple of notes:
- The labels of the categories are random. In this case label '2' refers to higher activity than lavel '1'.
- I didn't migrate the labels from 0-4 to A-E. This can easily be done using pandas'
map
.
Solution 2:
Try the below solution:
df['Categ'] = pd.cut(df.ActivityCount, bins=5, labels=list('ABCDE'))
It creates Categ column - a result of division of ActivityCount into 5 bins, labelled with A, ... E.
Borders of bins are set by division of full range into n subranges of equal size.
You can also see the borders of each bin, calling:
pd.cut(df.ActivityCount, bins=5, labels=list('ABCDE'), retbins=True)[1]
Post a Comment for "Python Generate Unique Ranges Of A Specific Length And Categorize Them"