At Doma, we use machine learning and natural language processing (NLP) to provide a fast and painless mortgage closing experience. Our rich text data helps us solve a range of interesting problems. In this post, I will share a brief overview of how we used a vector representation of our data and visualization techniques to optimize performance of one of our models.

One great thing about being a data scientist in this day and age is that we have accessed: 1. large amounts of compute power, 2. vast amounts of data, 3. fundamental research, and 4. a vibrant open source community to take intrinsically unstructured data like images and text and map it into a metric space where we CAN measure differences.

Figure 1 shows a wonderful example of this. While we may not be able to quantify the difference between a picture of the below lovely labrador, cute pug, and shy shibu inu, we can run these pictures through a large computer vision model like ResNet and quantify distances between the vector representations of these pictures.

*Figure 1. A. Cute dog pictures with no clear metric for differentiation. B. Vector representations of the dog pictures.*

We will use the ideas shown in Figure 1 about measuring unstructured data for insights into our text-based model.

Our motivation for the model and related visualization techniques was to streamline a lengthy set of tasks related to the home buying experience. By using machine learning and NLP techniques to classify a large amount of text data, we sped up a tedious, error-prone, and sometimes ambiguous manual process and provided a better experience for our clients. One goal was to significantly reduce the amount of time our own associates spend on this work, while our ultimate goal was to provide an accurate “instant” service for all transaction parties including our lending clients and their customers.

To build our model, we leveraged DistilBERT, the smaller, faster, cheaper, lighter version of the Bidirectional Encoder Representations from Transformers (BERT) model, which is a general language model.

The goal was to build a multiclassification model for our text data, and our approach was to fine-tune a pre-trained DistilBERT model for sequence classification using the huggingface transformers package. For the vector representation of our dataset, we used the output of the pre-classification layer of our fine-tuned model (see Appendix for the code sample).

This text vector representation is 768 dimensions, so we next turned to dimension reduction techniques to visualize our data.

I first started thinking about moving between dimensions when I read the book Flatland as a child. The book details the lives of beings who exist in zero, one, and two dimensions.

In the book, a sphere visiting from 3-D spaceland looking down on 2-D flatland tells a circle, “I discerned all that you speak of as solid (by which you mean “enclosed on four sides”)… even your insides and stomachs, all lying open and exposed to my view.” The sphere then proceeds to give the circle a light touch on his stomach to show that he has access to a dimension of which the circle cannot conceive. After that, whenever I had a stomach ache, I would wonder if a higher dimensional being were accessing my insides.

For now, I’ve accepted that I will not be able to appreciate my high-dimensional data in its natural state, so we will focus on low-dimensional representations. To develop intuition for projecting high-dimensional data down to two dimensions, let’s first consider projecting 3-D data into 2-D.

Cartographers have been projecting the earth’s surface into 2-D maps for centuries, and Figure 3 shows some of the challenges that result.

*Figure 2. Two different projections of the earth into 2-D. Each projection has tradeoffs between distortion and continuity. Credit to Daniel R Strebe (https://futuremaps.com/blogs/news/top-10-world-map-projections)*

The leftmost plot preserves continuity at the expense of distortions in area at the North and South poles while the right most plot preserves the area of countries at the expense of the continuity of the surface.

For high-dimensional data, it becomes even more complicated to preserve small- and large-scale patterns as we translate the data into 2-D, but a common approach is to use the t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm.

The t-SNE algorithm calculates the probability that any two pairs of data points are neighbors. The further apart the points are in Euclidean distance, the lower the probability they are neighbors, and vice versa. The distribution of these pairwise probabilities is calculated in both the original high-dimensional space as well as the new low dimensional space. The algorithm then uses stochastic gradient descent to minimize the differences between the two distributions.

More specifically, as seen in Equation 1, the cost function to minimize the differences between the high- and low-dimensional distributions of the data is given by the Kullback-Leibler divergence where *p _{ij}* represents the probability-based distance between points

As we saw from the world map projections in Figure 2, it is impossible to preserve both true distance and the relative area of land masses with one projection. It’s important to keep in mind that every projection is a simplification which will cause some amount of distortion. In the same way, t-SNE plots provide intuition but can easily be misleading if poor combinations of parameters are used.

In practice, it is both possible to see structure in a plot when none exists in the data and vice versa. For the plots shown in the next section, we experimented with perplexity, the number of iteration steps, and learning rate until we found a stable output. The code for obtaining the t-SNE vectors is shown in the Appendix.

Based on summary metrics of precision and F1 score, the DistilBERT-based multiclassification model was performing well. Using a variety of text preprocessing techniques as well as experimentation with optimizers, stopping criteria, and different flavors of transformer models had produced a respectable model. But could we do better?

To push model performance even higher, we turned to our data. We used the two concepts of 1. mapping text data into a high-dimensional vector space and 2. projecting the high dimensional vectors into 2-D to produce the t-SNE plot of our data in Figure 3.

We have anonymized the data but can still see interesting features in Figure 3 on several different scales. The coloring of the points is based on ground truth labels. Each black dot belongs to one of 30 classes that has been predicted well based on precision and F1 score, and each larger colorful dot belongs to a class that has not met our minimum requirements for precision and F1 score.

The larger well-formed clusters in the center of the plot illustrate that our model is performing quite well for classes with more data that have clear patterns in their text representations. We can also see smaller groups of black dots that represent classes that may not be as well populated but are still predicted well.

The colorful points appear to be fewer in number and scattered more widely across the plot than the black points. Not surprisingly, when we looked at the raw data, we confirmed that classes with fewer data points and more variation are not predicted as well.

Further inspection shows some large, well-formed clusters of black points include some of the colorful points, suggesting mislabels in the data. In addition, as we look at the pink data points, we see some evidence of clustering within that class, hinting that this one class should perhaps be split into several classes.

We decided to explore these possibilities further.

Figure 4 shows zoomed in portions of the t-SNE plot in Figure 3. These segments highlight what appears to be mislabeled data.

*Figure 4. Portions of the graph in Figure 3 that highlight possible mislabeled data.*

In the leftmost plot, we see pink points from Class 1 mixed into a separate class (represented by black points) that is predicted well. This suggests that the ground truth label of Class 1 for these points may actually be incorrect and these points actually belong to the class in which they are grouped in the t-SNE plot. Likewise, in the middle plot, we see potential mislabels of Class 1 and Class 5 (blue points). Finally, we see additional mislabels of Class 4 (red points) in the last plot.

These subplots motivated us to do targeted manual inspection of our data. Even though the total amount of mislabeled data was quite small, just a few mislabeled data points in classes with less data had an outsized effect on precision and F1 scores for those classes. By correcting the labels, overall model performance improved.

We also noticed clustering within several classes, suggesting that a single class might be better represented by more than one class. In Figure 5, we see two examples of potential clustering within Class 1 depicted by the light pink points within the darker outline.

*Figure 5. Portions of the graph in Figure 3 that show candidates for either new or more inclusive class labels.*

For the left plot, we found that merging some of the pink points in Class 1 into the same class as the black points which they overlap resulted in better performance. The resulting class is indicated by the darker pink outline. For the right plot, we also found support for creating a new class for the cluster of points shown, again indicated by the outline.

Visualization techniques are an important part of the model iteration process. After experimenting with text preprocessing and model parameters, t-SNE plots directed us to look more critically at our training data. The greatest gains in model performance then came from initiating dialogue with other teams about how the data was created and working together to improve the quality of our data.

By using insights from our t-SNE plots, we created tools to surface mislabeled data and highlight natural clustering within classes. We also initiated procedures with other teams to bring more rigor to our data collection process. Now, as our datasets and company grow, we are well-positioned to collectively improve our data and model and continually deliver better value for our customers.

Our first step for visualization was to map our text data into a vector space. To do this, we took a pre-trained DistilBertForSequenceClassification model from huggingface that we fine-tuned on our text classification task. For simplicity, I omit the code for the fine-tuning of the model.

In the below code snippet, the get_activation function is used to register a forward hook to extract the output from the layer of interest. The get_vectors function takes as input a pytorch DataLoader object representing the tokenized input text data and a fine-tuned DistilBertForSequenceClassification model and returns vectors corresponding to the layer of interest. In this case, we are interested in the output of the pre_classifier layer.

activation = {} def get_activation(name): def hook(model, input, output): activation[name] = output.detach() return hook def get_vectors(data_loader, model, layer='pre_classifier') getattr(model, layer).register_forward_hook(get_activation(layer)) text_vectors = [] for step, batch in enumerate(gold_loader): b_input_ids = batch['input_ids'] b_input_mask = batch['attention_mask'] model.eval() with torch.no_grad(): outputs = model( b_input_ids, attention_mask=b_input_mask) layer_output = nn.ReLU()(activation[layer]).cpu().numpy().tolist() text_vectors.extend(layer_output) return text_vectors

For faster and more reliable computation of t-SNE vectors, it is recommended to first compress very high dimensional data using linear techniques such as principal components analysis (PCA). Below is a code snippet in which we retain enough components of the high-dimensional text vectors from the get_vectors function above to explain 95% of the variance in the data. This reduced our text vector dimensions from 768 down to ~30 components.

We experimented with t-SNE transformations on the ~30 dimensional vectors using a variety of parameters. We found that the default t-SNE worked well for us and used the below code snippet to obtain our vectors after both the PCA and t-SNE transformations.

def get_transformed_vectors(text_vectors, explained_variance=0.95): # Return PCA'ed vectors and t-SNE vectors bert_pca = PCA() bert_pca.fit(fee_description_vectors) n_components = np.where(np.cumsum(bert_pca.explained_variance_ratio_)>explained_variance)[0][0] bert_pca = PCA(n_components=n_components) bert_pca.fit(fee_description_vectors) bert_pca_vectors = bert_pca.transform(fee_description_vectors) bert_tsne = TSNE() bert_tsne_vectors = bert_tsne.fit_transform(bert_pca_vectors) return bert_pca_vectors, bert_tsne_vectors

Get notified when new blogs post