The tf-idf-statistic (“term frequency – inverse document frequency”) is a common tool for the purpose of extracting keywords from a document by not just considering a single document but all documents from the corpus. In terms of tf-idf a word is important for a specific document if it shows up relatively often within that document and rarely in other documents of the corpus. I used tf-idf for extracting keywords from protocols of sessions of the German Bundestag and am quite happy with the results. Given that I was dealing with (so far) 18 documents, together containing more than one million words which would have to be aggregated for the term frequency, then outer joined and then fed to the formula I was first a bit worried about how R would perform. To my surprise the whole processing from reading the files from disk to the final table of tf-idf-values took about 8 seconds. That’s not bad at all.

# The tf part

The tf-formula is a ratio of a term’s occurences in a document and the number of occurences of the most frequent word within the same document. Why this makes sense is pretty self explanatory. But obviously we would end up with stop words yielding high scores – and even if those would have been discarded before, a lot of words naturally show up often in a long text but aren’t relevant to the specific document.

# The idf part

And this is exactly where the idf-factor comes into play as it represents the inverse of the share of the documents in which the regarded term can be found. The lower the number of containing documents relative to the size of the corpus, the higher the factor. The reason why this ratio is not used directly but instead its logarithm is because otherwise the effective scoring penalty of showing up in two documents would be too extreme. As you can see in the plot – the idf for a term found in just one document is twice the idf for a term found in two. This would heavily bias the ranking in favor of super-rare words even if the tf-factor indicates a high relevance. It is very unlikely that a word is of high relevance in one document but never used anywhere else.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# R code for above chart library(ggplot2) df <- data.frame( x = c(1:20,1:20), y = c(1/1:20,log(20/1:20)/log(20)), idf = c(rep("1/n",20),rep("log",20)) ggplot(data = df), aes(x,y)) + geom_line(aes(color = idf)) + geom_point(aes(color = idf)) + scale_x_discrete() + labs(title = "comparison of relative impact of idf-formula (scaled to 1) if term occurs in more or less documents") + xlab("number of documents a term is contained in") + ylab("") + theme(axis.title.x = element_text(color="grey20")) |

# tf and idf together

Both factors react positively to higher relevance – one using local information, the other taking a more global perspective. So we can simply take the product and we have the “traditional” tf-idf-formula. But this formula is not set in stone and depending on whether you want to put emphasis on the tf- or rather on the idf-part it might make sense to get a feeling for the ranking behaviour in your use case and apply adjustments to it until you are d’accord with the result.

For example in the case of the Bundestag protocols I came to the conclusion that the final rankings are more useful if I put some more weight on the idf-part – which leads to effectively penalizing a word’s ranking if the word shows up in more than one document. This seems to make sense right now as my corpus only keeps 18 documents. So what I did is I added the idf-part and divided it with the size of the corpus – see tfidf’ in the top image. That has the effect that the penalization of non-exclusivity is decreasing with time as the corpus grows in size (more protocols being published).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# calculates tf-idf for different parameters and using # different tf-idf-versions tab_tfidf <- function(ncorpus=20) { # I assume a maximum word frequency of 4000 max_ft <- 4000 # tf-idf without log tfidf0 <- function(ft,max_ft,ndocs,ncorpus) (ft/max_ft) * (ncorpus/ndocs) # traditional tf-idf tfidf1 <- function(ft,max_ft,ndocs,ncorpus) (ft/max_ft) * log(ncorpus/ndocs) # tf-idf with added idf/N tfidf2 <- function(ft,max_ft,ndocs,ncorpus) (1/ncorpus + ft/max_ft) * log(ncorpus/ndocs) # ft = frequency of term / ndocs = how often it showed up in other documents df <- expand.grid(ft=c(5,10,20,30),ndocs=c(1,2,3,5,10)) res0 <- apply(df,1,function(r) tfidf0(r["ft"],max_ft,r["ndocs"],ncorpus)) ranks0 <- order(order(-res0)) res1 <- apply(df,1,function(r) tfidf1(r["ft"],max_ft,r["ndocs"],ncorpus)) ranks1 <- order(order(-res1)) res2 <- apply(df,1,function(r) tfidf2(r["ft"],max_ft,r["ndocs"],ncorpus)) ranks2 <- order(order(-res2)) result <- cbind(df,res0,res1,res2,ranks0,ranks1,ranks2) result <- result[order(result$ft),] return(list("ncorpus" = ncorpus, "max_ft" = max_ft, result)) } # tf-idf for combinations of term frequency in {10,20,30} and # occurences in {1,2,3} relative to (20, 2) get_change_matrix <- function(res, colname) { m <- matrix(res[res$ft %in% c(10,20,30) & res$ndocs %in% 1:3,colname], ncol=3) # num of documents where word is assumed to be present rownames(m) <- as.character(1:3) # num of occurences within the hypothetical document colnames(m) <- as.character(1:3*10) # (A-B)/B m <- round((m - m[2,2])/m[2,2],2) return(m) } |

On that note – let’s see how the two versions of tf-idf compare – assuming a corpus containing 20 documents and the most frequent word to show up 4000 times.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
> res <- tab_tfidf() # tfidf (traditional) > get_change_matrix(res[[3]],"res1") 10 20 30 1 -0.35 0.30 0.95 2 -0.50 0.00 0.50 3 -0.59 -0.18 0.24 # tfidf' (my flavor) > get_change_matrix(res[[3]],"res2") 10 20 30 1 0.24 0.30 0.36 2 -0.05 0.00 0.05 3 -0.21 -0.18 -0.14 |

Let’s for example compare a word A occuring 20 times in the document d and in 2 documents of the corpus to another word B occuring 10 times in the document d and in only 1 document (d, of course) of the corpus. In case of the traditional formula the tf-idf-statistic of B would be 35% lower compared to the tf-idf-statistic of A. In case of the alternated tf-idf-formula on the other hand we would experience an increase of B’s tf-idf-statistic by 24% compared to A’s! So it makes sense to think a bit about how exactly one would like the score to behave / rank words.

I didn’t want to spend too much time on this sub-project, so I kept it at my rule of thumb optimization. But when you start thinking about the tf-idf idea a lot of possible tweaks come to mind. To give an examle, true to the motto “once won’t do any harm” we could restrict the denominator of idf to taking only documents into account where the term shows up at least twice.

# Constraints

Obviously the tf-idf approach to quantification of a keyword’s relevance has its limits. For example it is easily conceivable that a word will show up in a lot of documents of the corpus and yet play a central role in each of it. Or a subject is covered in several documents **because it is** very important – but tf-idf would penalize terms typical for this subject exactly because of that reason. This is why tf-idf is most certainly not the answer to everything. I came across another idea described in a paper from 2009 where the density of a term is used to infer its relevance. The basic idea is that a very relevant word will show relatively strong local densities compared to a common word with a more uniform density. Below you see the a density approximation for three stop words (“and”,”the” (male) and “the” (female)) and the densities for three terms that scored highest with respect to tf-idf in protocol #11.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# vector of words from protocol #11 - length: 127375 words # wv <- c("Inhaltsverzeichnis", "Plenarprotokoll", "Deutscher", ...) plot(density(which(wv=="und"), bw="SJ", n=1000, adjust=.5, from=0, to=length(wv)), main="black: 'und', blue: 'die', green: 'der'", xlab="index of word") lines(density(which(wv=="die"), bw="SJ", n=1000, adjust=.5, from=0, to=length(wv)),col="blue") lines(density(which(wv=="der"), bw="SJ", n=1000, adjust=.5, from=0, to=length(wv)),col="green") plot(density(which(wv=="Tierhaltung"), bw="SJ", n=1000, adjust=.5, from=0, to=length(wv)), main="black: 'Tierhaltung', blue: 'Verbraucherpolitik', green: 'Rechtspolitik'", xlab="index of word") lines(density(which(wv=="Verbraucherpolitik"), bw="SJ", n=1000, adjust=.5, from=0, to=length(wv)),col="blue") lines(density(which(wv=="Rechtspolitik"), bw="SJ", n=1000, adjust=.5, from=0, to=length(wv)),col="green") |

If it is possible to take the distribution into account then we can even determine relevancy for insulated single documents.

(original article published on www.joyofdata.de)

cool stuff! have you tried using stemming to increase accuracy?

see you on Monday!

Hi Christian, thanks!

And that is definitely a good suggestion! Will consider that for a new version of the processing.

See you