Document Classification With Solr Streaming Expressions

November 8, 2019

Classification is one of the most popular tasks in Natural Language Processing and Machine Learning. Solr ships with features, a subset of Streaming Expressions features, that allows building and deploying statistical classification models out-of-the-box. With adequate preprocessing and indexing tweaks, these features can be used to classify documents quickly and with high accuracy. This post illustrates how Solr streaming expressions and Zeppelin notebooks can be used to build a document classifier.

Dataset and Preprocessing

In this post, the BBC News dataset will be used. It will be split into training and testing subsets with sizes 1,999 and 226 respectively, indexed into two collections, bbc-text-train and bbc-text-test. Since Solr training feature, namely train expression, takes a binary classified documents, a label binarizer preprocessing is required before indexing. The following Python code can be used to achieve that:

%python

import pandas as pd

from sklearn import preprocessing


lb = preprocessing.LabelBinarizer()


df = pd.read_csv('~/workspace/data/bbc-text/bbc-text-train.csv')

category = df['category_s'] #you can also use df['column_name']

lb.fit(category)

encoded_category = lb.transform(category)

df_encoded_category = pd.DataFrame(data=encoded_category, 

                        columns=map(lambda c: c + '_i', lb.classes_), index=None)

df_con = pd.concat([df, df_encoded_category], axis=1)


df_con.to_csv('~/workspace/data/bbc-text/bbc-text-preprocessed-train.csv', index=False)

Training and Building Models

Having the training set indexed(1), the following streaming expression can be used to train and store the mode into Solr:

%solr

stream commit(bbc-text-model,update(bbc-text-model, batchSize=500,

            train(bbc-text-train,

                  features(bbc-text-train, q="*:*", featureSet="featureSet", field="text_t", outcome="tech_i", numTerms=25),

                    q="*:*",

                    name="bbc-text-tech-classification-model",

                    field="text_t",

                    outcome="tech_i",

                    maxIterations=150)))

The above expression extracts features from the training set, trains a model using these features and commits the result to bbc-text-tech-classification-model collection. Because label binarizeation is used, an ensemble of models is created represented by a classifier for each class, looping the above code over categories technically speaking.

Document Classification

The next step is to use the models built to classify documents. The following streaming expression can be used to apply that:

%solr

stream classify

    (model(bbc-text-model, id="bbc-text-tech-classification-model"), 

    search(bbc-text-test,q="*:*",fl="text_t, id, category_s",sort="id desc", rows=50),

    field="text_t")

The output of the above stream is tuples of the classified document along with two additional fields: probability_d which is useful for classifying the document and score_d which is useful for ranking. In our case, the assigned class is the one with the highest probability. 

Learning Curves

Learning curves can be plotted using the stored model as follows:

%solr

search q=*:*&collection=bbc-text-model&fl=name_s,trueNegative_i,truePositive_i,falseNegative_i,falsePositive_i,iteration_i&sort=iteration_i%20asc&fq=name_s:bbc-text-tech-classification-model&rows=10

The expression above returns the confusion matrix for each class. Inside a Zeppelin notebook, the following line chart is obtained for the above stream:

Technology Class Learning Curve

Error Decay

Here, we plot error versus iterations. The following code can be used to plot error curves:

%solr

search q=*:*&collection=bbc-text-model&fl=name_s,error_d,iteration_i&sort=iteration_i%20asc&rows=400&fq=iteration_i:[2%20TO%20*]

Note: iteration 1 error is very high compared to successive iterations which makes curves very skewed to the left. So, in the above code, we start drawing error curves from iteration 2.

Error Decay Curve

Evaluation

There are many evaluation metrics that can be used to evaluate classifiers, below are the most popular ones.

Confusion Matrix

Confusion matrix is a mathematical representation of a set of classification performance characteristics. The elements of the matrix are stored along the generated model by Solr. However, due to a glitch as of the time of writing this post, these elements are always less than the actual. Alternatively, the classification results can be exported and the calculations can be done in Python, nevertheless, Solr interpreter can be used directly when the issue is addressed.

To export the classification results, Zeppelin export functionality, located at the top right corner of table views, can be used. Once all results are exported, they can be moved to Zeppelin for processing using docker cp or a mounted volume.

Zeppelin Table Export and View Options

The Python code below loads the classification results into Pandas data frames, combines them and calculates the maximum probability and the corresponding category:

%python

import pandas as pd

from sklearn.metrics import confusion_matrix


categories = ['tech', 'business', 'entertainment', 'sport', 'politics']


df = {}

probabilities = ['probability_tech_d']

df['combined'] = pd.read_csv('~/classification-results/tech.csv')

df['combined'] = df['combined'].drop(['score_d'], axis=1)

df['combined'] = df['combined'].rename({'probability_d': 'probability_tech_d'}, axis='columns')

for c in categories:

    if c == 'tech':

        continue

    df[c] = pd.read_csv('~/classification-results/' + c + '.csv')

    df[c] = df[c].drop(['text_t', 'category_s', 'score_d'], axis=1)

    df[c] = df[c].rename({'probability_d': 'probability_' + c + '_d'}, axis='columns')

    df['combined'] = df['combined'].merge(df[c], on='id')

    probabilities.append('probability_' + c + '_d')


df['combined']['probability_max_d'] = df['combined'][probabilities].max(axis=1)

df['combined']['probability_max_s'] = df['combined'][probabilities].idxmax(axis=1)

df['combined']['predicted_category_s'] = df['combined']['probability_max_s'].apply(lambda x: x.replace('probability_', '').replace('_d', ''))


print('Probabilities combined')

Based on the preceding data frame, the following paragraph calculates and visualizes the the confusion matrix inside Zeppelin:

%python

import numpy as np

import matplotlib.pyplot as plt


cm = confusion_matrix(df['combined']['category_s'], df['combined']['predicted_category_s'], labels=categories)


fig, ax = plt.subplots()

im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)

ax.figure.colorbar(im, ax=ax)

# We want to show all ticks...

ax.set(xticks=np.arange(cm.shape[1]),

      yticks=np.arange(cm.shape[0]),

      xticklabels=categories, yticklabels=categories,

      title='BBC Text Classification Confusion Matrix',

      ylabel='Category',

      xlabel='Predicted category')


ax.set_xticks(np.arange(cm.shape[1]+1)-.5, minor=True)

ax.set_yticks(np.arange(cm.shape[0]+1)-.5, minor=True)

    

# Rotate the tick labels and set their alignment.

plt.setp(ax.get_xticklabels(), rotation=45, ha="right",

         rotation_mode="anchor")


# Loop over data dimensions and create text annotations.

fmt = 'd'

thresh = cm.max() / 2.

for i in range(cm.shape[0]):

    for j in range(cm.shape[1]):

        ax.text(j, i, format(cm[i, j], fmt),

                ha="center", va="center",

                color="white" if cm[i, j] > thresh else "black")

fig.tight_layout()


z.show(plt)

The result plotted inside the notebook should look like this nice bluish figure:

Performance

There are many metrics that evaluate classification models. The most common ones are accuracy, precision, recall, and f1-score defined as:

The code snippet below computes these metrics using scikit-learn classification_report function:

%python

from sklearn.metrics import classification_report


print(classification_report(df['combined']['category_s'], df['combined']['predicted_category_s']))

Here’s an instance of the above metrics for the classification problem at hand:

Performance

There are many metrics that evaluate classification models. The most common ones are accuracy, precision, recall, and f1-score defined as:

The code snippet below computes these metrics using scikit-learn classification_report function:

%python

from sklearn.metrics import classification_report


print(classification_report(df['combined']['category_s'], df['combined']['predicted_category_s']))

Here’s an instance of the above metrics for the classification problem at hand:

               precision    recall  f1-score   support      

business            0.94      0.91      0.93        55 

entertainment       0.97      0.94      0.95        33      

politics            0.87      0.95      0.91        41         

sport               0.96      0.98      0.97        44          

tech                0.96      0.92      0.94        53      


accuracy                                0.94       226     

macro avg           0.94      0.94      0.94       226  

weighted avg        0.94      0.94      0.94       226

1. Solr dynamic field naming convention is assumed