Skip to content Skip to sidebar Skip to footer

Python Generate Unique Ranges Of A Specific Length And Categorize Them

I have a dataframe column which specifies how many times a user has performed an activity. eg. >>> df['ActivityCount'] Users ActivityCount User0 220 User1 19

Solution 1:

The natural way to do that would be to split the data into 5 quanties, and then split the data into bins based on these quantities. Luckily, pandas allows you do easily do that:

df["category"] = pd.cut(df.Activity, 5, labels= ["a","b", "c", "d", "e"])

The output is something like:

    Activity Category
34115b1543a57192        d
78271        e
2688b625a55186        d
63220        d
115a76268        e

An alternative view - clustering

In the above method, we've split the data into 5 bins, where the sizes of the different bins are equal. An alternative, more sophisticated approach, would be to split the data into 5 clusters and aim to have the data points in each cluster as similar to each other as possible. In machine learning, this is known as a clustering / classification problem.

One classic clustering algorithm is k-means. It's typically used for data with multiple dimensions (e.g. monthly activity, age, gender, etc.) This is, therefore, a very simplistic case of clustering.

In this case, k-means clustering can be done in the following way:

import scipy
from scipy.cluster.vq import vq, kmeans, whiten

df = pd.DataFrame({"Activity": l})

features = np.array([[x] forx in df.Activity])
whitened = whiten(features)
codebook, distortion = kmeans(whitened, 5) 
code, dist = vq(whitened, codebook)

df["Category"] = code

And the output looks like:

    Activity  Category
40       138         1
79       272         0
72       255         0
13        38         3
41       139         1
65       231         0
26        88         2
59       197         4
76       268         0
45       145         1

A couple of notes:

  • The labels of the categories are random. In this case label '2' refers to higher activity than lavel '1'.
  • I didn't migrate the labels from 0-4 to A-E. This can easily be done using pandas' map.

Solution 2:

Try the below solution:

df['Categ'] = pd.cut(df.ActivityCount, bins=5, labels=list('ABCDE'))

It creates Categ column - a result of division of ActivityCount into 5 bins, labelled with A, ... E.

Borders of bins are set by division of full range into n subranges of equal size.

You can also see the borders of each bin, calling:

pd.cut(df.ActivityCount, bins=5, labels=list('ABCDE'), retbins=True)[1]

Post a Comment for "Python Generate Unique Ranges Of A Specific Length And Categorize Them"