scikit learn - Adding New Text to Sklearn TFIDIF Vectorizer (Python) -

- May 15, 2013

is there function add existing corpus? i've generated matrix, i'm looking periodically add table without re-crunching whole sha-bang

e.g;

articlelist = ['here text blah blah','another text object', 'more foo bar right now'] tfidf_vectorizer = tfidfvectorizer(                         max_df=.8,                         max_features=2000,                         min_df=.05,                         preprocessor=prep_text,                         use_idf=true,                         tokenizer=tokenize_text                     ) tfidf_matrix = tfidf_vectorizer.fit_transform(articlelist)  #### adding new article existing set? bigger_tfidf_matrix = tfidf_vectorizer.fit_transform(['the last article wanted add'])

you can access vocabulary_ attribute of vectoriser directly, , can access idf_ vector via _tfidf._idf_diag, possible monkey-patch this:

import re  scipy.sparse.dia import dia_matrix  def partial_fit(self, x):     max_idx = max(self.vocabulary_.values())     in x:         #update vocabulary_         if self.lowercase: = a.lower()         tokens = re.findall(self.token_pattern, a)         w in tokens:             if w not in self.vocabulary_:                 max_idx += 1                 self.vocabulary_[w] = max_idx          #update idf_         df = (self.n_docs + self.smooth_idf)/np.exp(self.idf_ - 1) - self.smooth_idf         self.n_docs += 1         df.resize(len(self.vocabulary_))         w in tokens:             df[self.vocabulary_[w]] += 1         idf = np.log((self.n_docs + self.smooth_idf)/(df + self.smooth_idf)) + 1         self._tfidf._idf_diag = dia_matrix((idf, 0), shape=(len(idf), len(idf)))         self._tfidf._idf_diag         print((len(idf), len(idf)))         print(vec._tfidf._idf_diag.shape)                      tfidfvectorizer.partial_fit = partial_fit vec = tfidfvectorizer() vec.fit(articlelist) vec.n_docs = len(articlelist) vec.partial_fit(['the last text wanted add']) vec.transform(['the last text wanted add']).toarray()  # array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        , #          0.        ,  0.        ,  0.        ,  0.        ,  0.        , #          0.        ,  0.        ,  0.27448674,  0.        ,  0.43003652, #          0.43003652,  0.43003652,  0.43003652,  0.43003652]])

Search This Blog

celery

scikit learn - Adding New Text to Sklearn TFIDIF Vectorizer (Python) -

Comments

Post a Comment

Popular posts from this blog

mysql - Dreamhost PyCharm Django Python 3 Launching a Site -

java - Sending SMS with SMSLib and Web Services -

GuideWire BC configuration with SQL Server database -