91-DOU – Arcane Coding Skills

91-DOU : Day #5

Posted on 22/07/2024 by ds_sd

Course : Python for Machine Learning and Data Science Masterclass

Video Mins completed : 29 mins

Last Video # completed : 209

Hierarchical clustering : Cluster points similar to each other. Similarity based on distance metric.

Can be used to figure out the number of clusters.

Types of Hierarchical Clustering:

Agglomerative :Starts as each point is one cluster and then points joined to make to form bigger clusters.
Divisive : Starts with all points in a single cluster and then points divided into smaller clusters.

Data scaling methods

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

https://medium.com/@onersarpnalcin/standardscaler-vs-minmaxscaler-vs-robustscaler-which-one-to-use-for-your-next-ml-project-ae5b44f571b9

https://medium.com/@hhuseyincosgun/which-data-scaling-technique-should-i-use-a1615292061e

91-DOU : Day #4

Posted on 04/07/202405/07/2024 by ds_sd

Course : Python for Machine Learning and Data Science Masterclass

Video Mins completed : 126 mins

Last Video # completed : 206

K-Means clustering – Scale data if mix of numeric and encoded features present to ensure that clustering and distance measures do not get affected.

Find the correlation between the features and labels to know which features have highest bearing on the clustering

To figure out the ideal K value, check the SSD(sum sqrd dist) model.inertia_ param of the model for a range of K’s. If for the change of K the value has not dropped significantly, it indicates a cutoff value which can be taken as cluster value.

Intro to chloropleth which allows clusters to be represented on a map.

https://plotly.com/python/choropleth-maps

https://medium.com/@nirmalsankalana/k-means-clustering-choosing-optimal-k-process-and-evaluation-methods-2c69377a7ee4

91-DOU : Day #3

Posted on 02/07/202402/07/2024 by ds_sd

Course : Cluster Analysis & Unsupervised ML in Python

Video Mins completed : 27 mins

Last Video # completed : 29

Notes

KNN Cost functions –

Current metric : Distance from Cluster mean or center.

Does not scale well with default values.If features have varied scales, unless the data is scaled the algo will not work as the distance metric will vary wildly.
Does fit with large datasets
Sensitive to K

Another metric : Purity

Requires labels. Such methods called “external validation” methods. Examples

Rand Measure
F-measure
Jaccard Index
Normalized Mutual Info

Metric on unlabeled data : Davies Bouldin Index (DBI)

Lower DBI == better

How to choose ‘K’?

Value of K post which there is not significant change in cost will be the ideal value of K.

Course : Python for Machine Learning and Data Science Masterclass

Video Mins completed : 25 mins

Last Video # completed : 197

91-DOU : Day #2

Posted on 23/06/2024 by ds_sd

Course : Cluster Analysis & Unsupervised ML in Python

Video Mins completed : 50 mins

Last Video # completed : 22

Notes

K-Means Clustering

Cost function : Coordinate distance

Soft K-means : Assigns probability of a point belonging to a certain cluster based on the distance from the cluster mean.

Better than Hard K-means which assigns a 100% probability to one class.

K-Means clustering fails for data clusters shaped as

donut
elongated clusters
different density clusters.

Can only look for spherical clusters

Disadvantages

Need to choose K
Local Minima tripping the clustering
Sensitive to initial configuration
Doesn’t take into account the density of the cluster

91-DOU : Day #1

Posted on 22/06/202422/06/2024 by ds_sd

Course : Cluster Analysis & Unsupervised ML in Python

Video Mins completed : 125 mins

Last Video # completed : 13

Notes

Clustering application

Categorization
Search : Closest neighbors for an item
Density estimation : Finding probability distribution in the data.

Implemented exercises to understand the core logic of K-Means clustering. This was unnecessary. Implementation could have been skipped. Need to move faster.