91-DOU : Day #5

Course : Python for Machine Learning and Data Science Masterclass

Video Mins completed : 29 mins

Last Video # completed : 209

Hierarchical clustering : Cluster points similar to each other. Similarity based on distance metric.

Can be used to figure out the number of clusters.

Types of Hierarchical Clustering:

  • Agglomerative :Starts as each point is one cluster and then points joined to make to form bigger clusters.
  • Divisive : Starts with all points in a single cluster and then points divided into smaller clusters.

Data scaling methods

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

https://medium.com/@onersarpnalcin/standardscaler-vs-minmaxscaler-vs-robustscaler-which-one-to-use-for-your-next-ml-project-ae5b44f571b9

https://medium.com/@hhuseyincosgun/which-data-scaling-technique-should-i-use-a1615292061e

91-DOU : Day #4

Course : Python for Machine Learning and Data Science Masterclass

Video Mins completed : 126 mins

Last Video # completed : 206

K-Means clustering – Scale data if mix of numeric and encoded features present to ensure that clustering and distance measures do not get affected.

Find the correlation between the features and labels to know which features have highest bearing on the clustering

To figure out the ideal K value, check the SSD(sum sqrd dist) model.inertia_ param of the model for a range of K’s. If for the change of K the value has not dropped significantly, it indicates a cutoff value which can be taken as cluster value.

Intro to chloropleth which allows clusters to be represented on a map.

https://plotly.com/python/choropleth-maps
https://medium.com/@nirmalsankalana/k-means-clustering-choosing-optimal-k-process-and-evaluation-methods-2c69377a7ee4

91-DOU : Day #3

Course : Cluster Analysis & Unsupervised ML in Python

Video Mins completed : 27 mins

Last Video # completed : 29

Notes

KNN Cost functions –

Current metric : Distance from Cluster mean or center.

  • Does not scale well with default values.If features have varied scales, unless the data is scaled the algo will not work as the distance metric will vary wildly.
  • Does fit with large datasets
  • Sensitive to K

Another metric : Purity

Requires labels. Such methods called “external validation” methods. Examples

  • Rand Measure
  • F-measure
  • Jaccard Index
  • Normalized Mutual Info

Metric on unlabeled data : Davies Bouldin Index (DBI)

Lower DBI == better

How to choose ‘K’?

Value of K post which there is not significant change in cost will be the ideal value of K.

Course : Python for Machine Learning and Data Science Masterclass

Video Mins completed : 25 mins

Last Video # completed : 197

ACM Talk#June 27th :From ML Engineering to AI Engineering

Points captured from the talk by Chip Huyen.

AI Engg -> process of building apps with foundational models
Foundation models -> coined in Stanford

AI Engg
1) Model as a service -> Anyone can use AI
2) Open-ended evaluations -> open ended responses, harder to evaluate
How to evaluate -> Comparative evals(chatbot arena), AI-as-a judge, 5 prompts to evaluate

Evaluation is a big challenge.

Feature engg -> context construction
Problem -> Hallucination. hallucinate less when lots of context provided to models
Retreival (RAG) -> BM25, ElasticSearch,Dense passage Retrieval, Vector DB ( compute intensive)
Future-> trying to build embeddings for table
Agentic

Bigger size -> higher latency, expensive, requiring more expertise to host
Check this -> ????

Inference optimization -> h/w algo, model arch
cache
param efficient finetuning

Topics to check

Apache arrow
Debugging gen ai apps
Distributed systems for LLM

Some snapshots of the presentation made by Chip.

91-DOU : Day #2

Course : Cluster Analysis & Unsupervised ML in Python

Video Mins completed : 50 mins

Last Video # completed : 22

Notes

K-Means Clustering

Cost function : Coordinate distance

Soft K-means : Assigns probability of a point belonging to a certain cluster based on the distance from the cluster mean.

Better than Hard K-means which assigns a 100% probability to one class.

K-Means clustering fails for data clusters shaped as

  1. donut
  2. elongated clusters
  3. different density clusters.

Can only look for spherical clusters

Disadvantages

  1. Need to choose K
  2. Local Minima tripping the clustering
  3. Sensitive to initial configuration
  4. Doesn’t take into account the density of the cluster

91-DOU : Day #1

Course : Cluster Analysis & Unsupervised ML in Python

Video Mins completed : 125 mins

Last Video # completed : 13

Notes

Clustering application

  1. Categorization
  2. Search : Closest neighbors for an item
  3. Density estimation : Finding probability distribution in the data.

Implemented exercises to understand the core logic of K-Means clustering. This was unnecessary. Implementation could have been skipped. Need to move faster.

Codewars kata : Sum of pairs

This simple problem of finding a pair of numbers in a list of random integers, which adds up to a specific number, is easy to get working, but hard to optimize. So we will go through the various implementation strategies and see it metamorph to the final accepted version.

Implementation iterations

1. Maintain a cache of the left number seen. If that number has not generated a pass in the past ignore it. Also maintain an array of left , right indexes which have generated the result. This got the basic tests running, but the final submission timed out.

2. Maintain no cache of previously seen numbers, also maintain a tuple of (left index, right index, gap between two indices). Only if the gap is less for a new no# then overwrite the old entry. It iterates through the whole list hence not a great idea.

3. No need of cache, search only till the new idx + gap indexes, as any further we will get a pair which is farther away. Again as all the indices are being searched not an runtime efficient option.

def sum_pairs(ints, s):
    # startidx, endidx, gap
    nearest_pair = ()
    ints_len = len(ints)
    for i in range(ints_len):
        diff_s = s - ints[i]
        try:
            end_idx = ints_len
            if nearest_pair and nearest_pair[2] < ints_len:
                end_idx = i + nearest_pair[2]
            idx =  ints.index(diff_s, i+1, end_idx)
            gap = idx - i
            if not nearest_pair or (nearest_pair[2] > gap):
                nearest_pair = (i,idx, gap)
        except :
            pass
        if nearest_pair and nearest_pair[2] == 1:
            break

    if not nearest_pair:
        return None
    else:
        # sort the cache by start index & gap between indices
        return [ints[nearest_pair[0]], ints[nearest_pair[1]]]

4. We need to find the first two entries only, and then just return the one which is earlier. In worst case its O(n^2), but in best case it will be O(2n) only. But the codewars bot says its still too long running.

def sum_pairs(ints, s):
    # startidx, endidx, gap
    tuples_list = []
    ints_len = len(ints)
    for i in range(ints_len):
        diff_s = s - ints[i]
        end_idx = ints_len
        if tuples_list:
            end_idx = tuples_list[0][1]
        for z in range(i + 1, end_idx):
            if ints[z] == diff_s:
                tuples_list.append((i, z))
                break
        if len(tuples_list) == 2:
            if tuples_list[0][1] > tuples_list[1][1]:
                return [ints[tuples_list[1][0]],ints[tuples_list[1][1]]]
            return [ints[tuples_list[0][0]],ints[tuples_list[0][1]]]

5. My accepted solution#1: This involves using a cache to note numbers that have not generated any result. Then the same can be used to skip iterations. The worst case we will have O(n^2) with additional n entries needed for storage. Not too elegant, lots of verbose coding style.

def sum_pairs(ints, s):
    # startidx, endidx, gap
    tuples_list = []
    nums_cache = []
    ints_len = len(ints)
    for i in range(ints_len):
        if ints[i] in nums_cache:
            continue
        diff_s = s - ints[i]
        end_idx = ints_len
        if tuples_list:
            end_idx = tuples_list[0][1]
        num_seen = False
        for z in range(i + 1, end_idx):
            if ints[z] == diff_s:
                tuples_list.append((i, z))
                num_seen = True
                break
        if not num_seen:
            nums_cache.append(ints[i])
            continue
        if len(tuples_list) == 2:
            fe = tuples_list[0]
            se = tuples_list[1]
            if fe[1] > se[1]:
                return [ints[se[0]],ints[se[1]]]
            return [ints[fe[0]],ints[fe[1]]]


    if not tuples_list:
        return None
    if len(tuples_list) == 1:
        return [ints[tuples_list[0][0]],ints[tuples_list[0][1]]]

6. My accepted solution#2: I got a good use of for else and was able to shave off a few more milliseconds. This version adopts a more pythonic style.

def sum_pairs(ints, s):
    # startidx, endidx, gap
    tuples_list = []
    nums_cache = []
    ints_len = len(ints)
    for i in range(ints_len):
        if ints[i] in nums_cache:
            continue
        diff_s = s - ints[i]
        end_idx = ints_len if not tuples_list else tuples_list[0][1]
        for z in range(i + 1, end_idx):
            if ints[z] == diff_s:
                tuples_list.append((i, z))
                break
        else:
            nums_cache.append(ints[i])
                
        if len(tuples_list) == 2:
            fe = tuples_list[0]
            se = tuples_list[1]
            if fe[1] > se[1]:
                return [ints[se[0]],ints[se[1]]]
            return [ints[fe[0]],ints[fe[1]]]

    if not tuples_list:
        return None
    if len(tuples_list) == 1:
        return [ints[tuples_list[0][0]],ints[tuples_list[0][1]]]

        

7. Optimal solution(Not mine) : This simple solution shines as the no of lines is 1/5th of my implementation and still maintains a cache and gets the entries as well. Elegant and really optimal.

def sum_pairs(list, sum):
  s = set()
  for v in list:
    if (sum - v) in s:
      return [sum - v, v]
    s.add(v)

Lessons learnt

The optimal solution here is implemented in 1/5th of the lines and takes 1/3rd of the running time. So some insights while trying to figure the solution are.

  1. The optimal solution just focuses on getting the numbers which add up to the required target sum. It does not maintain indexes to get the one with the smallest gap. This eliminates all the need to cache, compare and return the correct indexes.
  2. Set is slower than list in append operations. So where ever possible use a list as its append() makes it super fast than set’s add().