Notes on some stuff I did and might be of interest to others.

Using published results for the performance of state-of-the-art AI
document-vector embeddings and semantic hashing I evaluated a canonical
document-vector retrieval system boosted by approximate nearest neighbour
search.

The cosine measure is the prevailing similarity function for the document vector model of IR. We discuss a its
connection to the *intrinsic dimension*.

Some unix shell commands to get a plain text
list of web sites from publishers which are represented by VG
Medien
on behalf of the *Leistungsschutzrecht für
Presseverleger*, see Wikipedia entry Ancillary copyright for press
publishers.

How many servers does Google need for it's web search? How many pages
are crawled and indexed? Starting from Google's 2009 statement that it
uses *1 kJ energy per search* we estimate that Google used $\approx$
130.000 servers for its search in 2008. We also speculate that Google
only indexes 5% of its crawled pages.

Hierarchical agglomerative clustering (HAC) is a family of different
algorithms to perform grouping of data. HAC starts by merging the two
data points with smallest distance into a new cluster and finishes with
one big cluster describing the data.

How many pages does Microsoft's search engine Bing.com hold in its index?
Following the idea of Maurice de Kunder
we can roughly estimate the size of Bing's index being 300 million pages.

Initialization of k-means can have a big impact on the performance of the
k-means clustering
algorithm. Straight forward random initialization can lead to many more
iterations compared to a better initialization using kmeans++.