scikit learn - Adding New Text to Sklearn TFIDIF Vectorizer (Python) -
is there function add existing corpus? i've generated matrix, i'm looking periodically add table without re-crunching whole sha-bang
e.g;
articlelist = ['here text blah blah','another text object', 'more foo bar right now'] tfidf_vectorizer = tfidfvectorizer( max_df=.8, max_features=2000, min_df=.05, preprocessor=prep_text, use_idf=true, tokenizer=tokenize_text ) tfidf_matrix = tfidf_vectorizer.fit_transform(articlelist) #### adding new article existing set? bigger_tfidf_matrix = tfidf_vectorizer.fit_transform(['the last article wanted add'])
you can access vocabulary_
attribute of vectoriser directly, , can access idf_
vector via _tfidf._idf_diag
, possible monkey-patch this:
import re scipy.sparse.dia import dia_matrix def partial_fit(self, x): max_idx = max(self.vocabulary_.values()) in x: #update vocabulary_ if self.lowercase: = a.lower() tokens = re.findall(self.token_pattern, a) w in tokens: if w not in self.vocabulary_: max_idx += 1 self.vocabulary_[w] = max_idx #update idf_ df = (self.n_docs + self.smooth_idf)/np.exp(self.idf_ - 1) - self.smooth_idf self.n_docs += 1 df.resize(len(self.vocabulary_)) w in tokens: df[self.vocabulary_[w]] += 1 idf = np.log((self.n_docs + self.smooth_idf)/(df + self.smooth_idf)) + 1 self._tfidf._idf_diag = dia_matrix((idf, 0), shape=(len(idf), len(idf))) self._tfidf._idf_diag print((len(idf), len(idf))) print(vec._tfidf._idf_diag.shape) tfidfvectorizer.partial_fit = partial_fit vec = tfidfvectorizer() vec.fit(articlelist) vec.n_docs = len(articlelist) vec.partial_fit(['the last text wanted add']) vec.transform(['the last text wanted add']).toarray() # array([[ 0. , 0. , 0. , 0. , 0. , # 0. , 0. , 0. , 0. , 0. , # 0. , 0. , 0.27448674, 0. , 0.43003652, # 0.43003652, 0.43003652, 0.43003652, 0.43003652]])
Comments
Post a Comment