Name classification with ChatGPT

How does it compare to machine learning language models?

Author

Pedro Tabacof

Published

March 27, 2023

I explore the problem of name classification with ChatGPT and three machine learning models of increasing complexity: from logistic regression to FastAI LSTM to Hugging Face transformer. To see all the code and reproduce the results, check out the notebook.

Name classification

Can you classify a name as belonging to a person or company? Some are easy, like “Google” is a company and “Pedro Tabacof” is a name. Some are trickier, like “John Deere”. With a labelled dataset, we can train a machine learning model to classify names into entities. This is a simplification of the more general task called Named Entity Recognition. This can also be seen as a simple version of document classification, where the document is simply a name. Due to its simplicity and relation to typical NLP problems, name classification is a good candidate to experiment with different NLP technologies.

When I heard a friend was working on a name classification problem as part of a hiring process, I went straight to ChatGPT to look for answers. I soon realised that ChatGPT can do a great job itself classifying names into entities with just a couple of examples (one-shot learning):

Now, if I actually productionize that prompt using ChatGPT’s API, how would it compare to more traditional alternatives? In NLP, traditional might mean a model from just 5 years ago!

In this post, I explore four ways to classify names into person or company:

  1. Baseline using word counts and logistic regression: typical baseline for text classification
  2. FastAI LSTM fine-tuning (whole network): simple fine-tuning with few lines of code
  3. Huggingface DistilBERT fine-tuning (head only): more involved neural network training using PyTorch
  4. ChatGPT API one-shot learning: only prompt engineering and post-processing are needed

I use two public datasets available on Kaggle: IMDb Dataset for people names and 7+ Million Company Dataset for companies. Those datasets are large, with almost 20 million names! The choice of datasets was inspired by the open-source business individual classifier by Matthew Jones, which achieves 95% accuracy on this name classification task.

For simplicity, I sample 1M names for training and 100k for testing with a 50-50 balance between companies and people. Since we have balanced classes and ChatGPT cannot produce scores or probabilities (so we cannot use ROC AUC or average precision, definitely a big limitation of ChatGPT), I decided to use the accuracy as the main metric.

Datasets

First, I will download the datasets from Kaggle and do some basic preprocessing. To reproduce the results, you will need a Kaggle account and its command line installed locally. You need to add your API key and username to a file kaggle.json, which is found in the directory defined by the environment variable called KAGGLE_CONFIG_DIR.

Downloading free-7-million-company-dataset.zip to /notebooks
 95%|██████████████████████████████████████  | 265M/278M [00:03<00:00, 93.1MB/s]
100%|████████████████████████████████████████| 278M/278M [00:03<00:00, 80.1MB/s]
Downloading imdb-dataset.zip to /notebooks
 99%|██████████████████████████████████████▋| 1.05G/1.06G [00:06<00:00, 153MB/s]
100%|███████████████████████████████████████| 1.06G/1.06G [00:06<00:00, 163MB/s]
Archive:  free-7-million-company-dataset.zip
  inflating: companies_sorted.csv    
Archive:  imdb-dataset.zip
  inflating: data.tsv                

I do some preprocessing, inspired by the open-source repo I got the datasets inspiration from: 1. Lower case the people dataset since the companies dataset is all lower case (otherwise I’d suggest keeping the original case, as that can be informative). 2. Remove odd characters and unnecessary spaces. 3. Remove empty and null rows.

companies = pd.read_csv("companies_sorted.csv", usecols=["name"])

people = (
    pd.read_csv("data.tsv", sep="\t", usecols=["primaryName"])
    # Since the companies are all lower case, we do the same here to be fair
    .assign(name=lambda df: df.primaryName.str.lower()).drop("primaryName", axis=1)
)

df = pd.concat(
    (companies.assign(label="company"), people.assign(label="person"))
).sample(frac=1.0, random_state=42)

invalid_letters_pattern = r"""[^a-z0-9\s\'\-\.\&]"""
multiple_spaces_pattern = r"""\s+"""

df["clean_name"] = (
    df.name.str.lower()
    .str.replace(invalid_letters_pattern, " ", regex=True)
    .str.replace(multiple_spaces_pattern, " ", regex=True)
    .str.strip()
)

df = df[
    ~df.clean_name.isin(["", "nan", "null"]) & ~df.clean_name.isna() & ~df.label.isna()
][["clean_name", "label"]]

df.head(10)
name label
10103038 jinjin wang person
5566324 native waterscapes, inc. company
8387911 jeff killian person
6783284 lisa mareck person
9824680 pablo sánchez person
6051614 dvc sales company
6479728 orso balla person
4014268 two by three media company
2093936 house of light and design company
11914237 hamdy faried person

From the value counts below, we can see that we have 19.5 million names, 63% being people and 37% companies.

df.label.value_counts()
person     12344506
company     7173422
Name: label, dtype: int64

I sample 550k people and companies to make the dataset balanced and then split into 1M training and 100k testing examples.

train_df = pd.concat(
    (
        df[df.label == "company"].sample(n=1_100_000 // 2),
        df[df.label == "person"].sample(n=1_100_000 // 2),
    )
)

train_df, test_df = train_test_split(train_df, test_size=100_000, random_state=42)

I save the processed datasets for easier iteration. Tip: If you have large datasets, always try to save your preprocessed datasets to disk to prevent wasted computation.

# Saving the processed dataframes locally for quicker iterations
train_df.to_csv("train_df.csv", index=False)
test_df.to_csv("test_df.csv", index=False)

# Freeing up the memory used by the dataframes
del companies, people, df, train_df, test_df
gc.collect()

Since I freed up the memory of all datasets, I need to reload them:

# Just run from here if the datasets already exist locally
train_df = pd.read_csv("train_df.csv")
test_df = pd.read_csv("test_df.csv")

train_df.shape, test_df.shape
((1000000, 2), (100000, 2))

Now, I have one single dataset for training with 500k people and 500k companies and one single test set with 50k people and 50k companies.

Exploratory data analysis

Before I actually get to the fun part, let’s understand the data we have first. I have two hypotheses to explore:

  1. Do we see a different distribution of words per class? I’d expect some words like “ltd” to be present only in companies and words like “john” to be over-represented in names.
  2. Does sentence length vary by class? I expect higher range for companies than people, as companies can be from just two characters like “EY” to mouthfuls like “National Railroad Passenger Corporation, Amtrak”. Alternatively, I could look at the number of words per sentence, since most Western names are around 3 words.

Anyway, beware the Falsehoods Programmers Believe About Names.

words_df = (
    train_df.assign(word=train_df.clean_name.str.split(" +"))
    .explode("word")
    .groupby(["word", "label"])
    .agg(count=("clean_name", "count"))
    .reset_index()
)

total_words = words_df["count"].sum()

words_df = words_df.assign(freq=words_df["count"]/total_words)

person_words = (
    words_df[words_df.label == "person"].sort_values("freq", ascending=False).head(25)
)
company_words = (
    words_df[words_df.label == "company"].sort_values("freq", ascending=False).head(25)
)

First, let’s take a look at the word counts by label:

We can see our hypothesis was right: Some words are quite predictive of being a person or company name. Note that there is no intersection between the top 25 words for people and companies. This insight implies a simple but effective baseline would be a model built on top of word count, which is what I do next. However, there is a long tail of possible names, so we have to go beyond the most common ones. Another way to see how the distributions differ is by sentence length:

Company names tend to be longer on average and have a higher variance, but interestingly they both peak at 13 characters. I could use sentence length as a feature, but let’s stick to word counts for now.

Baseline: Word counts + Logistic regression

Let’s start with a simple and traditional NLP baseline: word frequency and logistic regression. Alternatively, we could use Naive Bayes, but I prefer logistic regression for its greater generality and easier interpretation as a linear model.

Typically, we use TF-IDF instead of word counting for document classification. Since names are quite short and repetitive words (e.g. “John”) are predictive, I believe it not to be useful here. Indeed, a quick test showed no improvement to accuracy by using TF-IDF.

Another varation is to use n-grams for either words or characters: they’re left as a suggestion to the reader.

text_transformer = CountVectorizer(analyzer="word", max_features=10000)
X_train = text_transformer.fit_transform(train_df["clean_name"])
X_test = text_transformer.transform(test_df["clean_name"])

logreg = LogisticRegression(C=0.1, max_iter=1000).fit(
    X_train, train_df.label == "person"
)
preds = logreg.predict(X_test)

baseline_accuracy = accuracy_score(test_df.label == "person", preds)
print(f"Baseline accuracy is {round(100*baseline_accuracy, 2)}%")
Baseline accuracy is 89.49%

89.5% accuracy is not bad for a linear model! Remember, since the datasets are balanced, a baseline accuracy without any information would be 50%. Now, whether this is good or bad in an absolute sense, it depends on the actual application of the model. It also depends on the distribution of the words this model would actually see in production. The datasets I used are quite general, containing all kinds of people and company names. In a real application, the names could be more constrained (e.g. only coming from a particular country).

Now, let’s see what mistakes the model makes (error analysis). It’s always interesting to look at examples where the model makes the worst mistakes. If we have a tabular dataset, it might be difficult to interpret what is going on, but, for perceptual data a human can understand (text, image, sound), this leads to invaluable insights into the model.

test_df["proba_person"] = logreg.predict_proba(X_test)[:, 1]
test_df["abs_error"] = np.where(
    test_df.label == "person", 1 - test_df.proba_person, test_df.proba_person
)

test_df.sort_values("abs_error", ascending=False)[
    ["clean_name", "label", "proba_person"]
].head(10)
clean_name label proba_person
60581 co co mangina person 0.000206
49398 buster benton and the sons of blues person 0.000984
6192 best horizon consulting person 0.001613
83883 les enfants du centre de loisirs de chevreuse person 0.002633
84646 manuel antonio nieto castro company 0.997350
32669 chris joseph company 0.996298
8545 hub kapp and the wheels person 0.004568
77512 michael simon p.a. company 0.994109
71392 dylan ryan teleservices company 0.993017
64777 netherlands national field hockey team person 0.007220

We can see that the mistakes are mostly understandable: There are many companies named just like people. How could the model know Chris Joseph is a company and not a person? The only way would be with information not available in the data I provided for its learning. We also see mislabelings in the people dataset: “netherlands national field hockey team” and “best horizon consulting” do not sound like people names!

This implies a high-leverage activity here would be cleaning the people dataset. If you want to make the data cleaning process sound sexier, just call it data-centric AI (just kidding: data-centric AI is actually a good framework to use for real-life machine learning applications where, in almost all cases, data trumps modelling).

FastAI LSTM fine tuning

For the first more complex machine learning model, let’s start with FastAI due to is simple interface. Following the suggestion of this article, I used an AWD_LSTM model which was pre-trained as a language model that predicts the next word using Wikipedia as dataset. Then, I fine-tune the model with our classification problem. FastAI fine-tune works in the following way: in the first epoch, it only trains the head (the newly inserted neural network on top of the pre-trained language model), then, for all subsequent epochs, it trains the whole model together. FastAI uses many tricks to make the training more effective, which is all wrapped in a simple function call. While convenient, it makes understanding what is going on behind the scenes and any customization more difficult.

fastai_df = pd.concat((train_df.assign(valid=False), test_df.assign(valid=True)))
dls = TextDataLoaders.from_df(
    fastai_df, text_col="clean_name", label_col="label", valid_col="valid"
)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(5, 1e-2)
epoch train_loss valid_loss accuracy time
0 0.513145 0.397019 0.802810 03:37
epoch train_loss valid_loss accuracy time
0 0.167889 0.137422 0.952030 07:53
1 0.157485 0.145000 0.956200 07:50
2 0.122625 0.139295 0.963160 07:53
3 0.112604 0.112886 0.968730 07:53
4 0.111916 0.111421 0.970460 07:53

Now we ended with 97.1% accuracy, almost 8 percentage points higher than our baseline! Not bad for a few lines of code and one hour of GPU time. Can we do better? Let’s try using a 🤗 transformer.

Hugging Face DistilBERT classification head training

Hugging Face offers hundreds of possible deep learning models for inference and fine-tuning. I chose DistilBERT due to time and GPU memory constraints. By default, Hugging Face trainer will fine-tune all the weights of the model, but now I just want to train the classification head, which is a two-layer fully-connected neural network (aka MLP). The reason is twofold: 1. We’re dealing with a simple problem and 2. I don’t want to leave the model training for too long to make reproducibility simpler and reduce GPU costs. I worked backwards from the previous results: Since FastAI took roughly one hour, I wanted to use the same GPU time budget here.

To only train the classification head, I had to use the PyTorch interface, which allows for more flexibility. First, I download DistilBERTs tokenizer, apply it to our dataset, then download the model itself, mark all layers as requiring no gradient (i.e. not trainable), and then train the classification head.

batch_size = 32
num_epochs = 3
learning_rate = 3e-5
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

First, I apply the DistilBERT tokenizer to our datasets:

tokenized_train_df = tokenizer(
    text=train_df["clean_name"].tolist(), padding=True, truncation=True
)
tokenized_test_df = tokenizer(
    text=test_df["clean_name"].tolist(), padding=True, truncation=True
)

Now, I create a PyTorch dataset that is the able to hand the input format given by the tokenizer (tokens + attention mask), alongside the labels:

train_dataset = NamesDataset(
    tokenized_train_df, (train_df.label == "person").astype(int)
)
test_dataset = NamesDataset(tokenized_test_df, (test_df.label == "person").astype(int))
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)

I set all the base model parameters as non-trainable (so that only the classification head is trained):

for param in model.distilbert.parameters():
    param.requires_grad = False

Finally, I actually train the model (see full training and evaluation code in the notebook):

epoch 0: test accuracy is 96.527%
epoch 1: test accuracy is 96.776%
epoch 2: test accuracy is 96.854%

We got 96.8% accuracy, essentially the same as FastAI LSTM model. This implies the extra complexity here was for nought. Of course, this problem is a simple one: If I had a more complex problem, I’m sure using a stronger pre-trained language model would give an edge relative to the simpler LSTM trained on Wikipedia. Also, by not fine-tuning the whole network, we miss out on the full power of the transformer. But this suggests that you shouldn’t write off FastAI without trying, which, as I show above, is quite simple.

Let’s see which mistakes this model is making:

test_df["proba_person"] = test_preds
test_df["abs_error"] = np.where(
    test_df.label == "person", 1 - test_df.proba_person, test_df.proba_person
)
test_df.sort_values("abs_error", ascending=False)[
    ["clean_name", "label", "proba_person"]
].head(10)
clean_name label proba_person
6192 best horizon consulting person 0.000008
47006 rolf schneebiegl & seine original schwarzwald-musi person 0.000326
58512 development person 0.000363
9404 xin yuan yao company 0.999556
59585 cheng hsong company 0.999490
46757 compagnie lyonnaise du cin ma person 0.000550
38224 pawel siwczak company 0.999389
25983 sarah hussain company 0.999311
23870 manjeet singh company 0.999295
73909 glassworks person 0.000776

Again, we see cases of clear mislabeling in the case of person and some tough cases in the case of company. Given the accuracy and the worst mistakes, we may be at the limit of what can be done for this dataset without cleaning it. Now, the final question: Can I get the same level of accuracy without any supervised training at all?

ChatGPT API one-shot learning

I will use OpenAI’s API to ask ChatGPT to do name classification for us. First, I need to define the prompt very carefully, what is now called prompt engineering. There are some rules of thumb for prompt engineering. For example, always give concrete examples before asking ChatGPT to generalize to new ones.

The ChatGPT API has three prompt types:

  • System: Helps set the tone of the conversation and gives overall directions
  • User: Represents yourself, use it to state your task or need
  • Assistant: Represents ChatGPT, use it to give examples of valid or reasonable responses

You can mix and match all prompt types, but I suggest starting with the system one, having at least one round of task-response examples, then restating the task that will actually be completed by ChatGPT.

Here, I ask for ChatGPT to classify 10 names into person or company. If I ask for more, say 100 names, there is a higher chance of failure (e.g. it sees a weird string and complains there is nothing it can do regarding the whole batch). If there is still a failure, I do a backup query on each name individually. If ChatGPT fails to provide a clear answer on an individual name, I default to answering “company” since this class contains more problematic strings.

Finally, how can I extract the labels from ChatGPT’s response? It might answer differently, for example, by fixing a misspelling or by using uppercase instead of lowercase (system prompt notwithstanding). In general, it answers in the same order, but can I rely on that completely for all 100k examples? To be safe, I do a simple string matching based on the Levenshtein distance to match the names I query with ChatGPT’s responses.

To reproduce the code below, you need to have an OpenAI account and OPENAI_API_KEY set in your environment

system_prompt = """
You are a named entity recognition expert.
You only answer in lowercase.
You only classify names as "company" or "person".
"""

task_prompt = "Classify the following names into company or person:"

examples_prompt = """google: company
john smith: person
openai: company
pedro tabacof: person"""

base_prompt = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": task_prompt},
    {"role": "assistant", "content": examples_prompt},
    {"role": "user", "content": task_prompt},
]

all_preds = []


def get_chatgpt_preds(batch_df):
    """ Gets predictions for a whole batch of names using ChatGPT's API"""
    prompt = base_prompt.copy()
    prompt += [{"role": "user", "content": "\n".join(batch_df.clean_name)}]

    openai.api_key = os.getenv("OPENAI_API_KEY")

    try:
        # Max tokens as 20000 is enough in practice for 10 names plus the prompt
        # Temperature is set to 0 to reduce ChatGPT's "creativity"
        # Model `gpt-3.5-turbo` is the latest ChatGPT model, which is 10x cheaper than GPT3
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo", messages=prompt, max_tokens=2000, temperature=0
        )

        # Since we gave examples as "name: class", ChatGPT almost always follows this pattern in its answers
        text_result = response["choices"][0]["message"]["content"]
        clean_text = [
            line.lower().split(":") for line in text_result.split("\n") if ":" in line
        ]

        # Fallback query: if I cannot find enough names on the response, I ask for each name separately
        # Without it, we'd have parsing failures once every 10 or 20 batches
        if len(clean_text) < len(batch_df):
            clean_text = []
            for _, row in batch_df.iterrows():
                prompt = base_prompt.copy()
                prompt += [{"role": "user", "content": row.clean_name}]

                response = openai.ChatCompletion.create(
                    model="gpt-3.5-turbo", messages=prompt, max_tokens=2000, temperature=0
                )

                row_response = response["choices"][0]["message"]["content"]
                if ":" in row_response:
                    clean_text.append(
                        [row_response.split(":")[0], row_response.split(":")[-1]]
                    )
                else:
                    clean_text.append([row.clean_name, "company"])  # defaults to company

        # To ensure I'm matching the query and the corresponding answers correctly,
        # I find the closest sentences in the Levenshtein distance sense
        batch_df = batch_df.copy()
        batch_df = batch_df.merge(pd.DataFrame({"resp": clean_text}), how="cross")
        batch_df["resp_name"] = batch_df.resp.str[0].str.strip()
        batch_df["resp_pred"] = batch_df.resp.str[-1].str.strip()
        batch_df["dist"] = batch_df.apply(
            lambda row: lev.distance(row.clean_name, row.resp_name), axis=1
        )
        batch_df["rank"] = batch_df.groupby("clean_name")["dist"].rank(
            method="first", ascending=True
        )
        batch_df = batch_df.query("rank==1.0")[["clean_name", "label", "resp_pred"]]
        
    # Catches all errors
    # Errors only arise due to failures from OpenAI API and should be quite rare
    # Ideally, we should keep retrying with exponential backoff
    except Exception as e:
        print("Exception:", str(e))
        batch_df = batch_df.copy()
        batch_df["resp_pred"] = "company" # defaults to company

    return batch_df
chatgpt_num_workers = 32
chatgpt_batch_size = 10
split_size = len(test_df) // chatgpt_batch_size
test_batches = np.array_split(test_df, split_size)

chatgpt_preds = Parallel(n_jobs=chatgpt_num_workers, verbose=5)(
    delayed(get_chatgpt_preds)(batch_df) for batch_df in test_batches
)
chatgpt_preds = pd.concat(chatgpt_preds)
chatgpt_accuracy = (chatgpt_preds.resp_pred == chatgpt_preds.label).sum() / len(
    chatgpt_preds
)

print(f"ChatGPT accuracy is {round(100*chatgpt_accuracy, 2)}%")
ChatGPT accuracy is 97.52%

Incredible! With 97.5% accuracy, ChatGPT managed to outperform complex neural networks trained for this specific task. One explanation is that it used its knowledge of the world to understand some corner cases that the models could not have possibly learned from the training set alone. In some sense, this would be a form of “leakage”: perhaps ChatGPT would be weaker classifying companies founded after its cutoff date (2021).

ChatGPT is also quite cheap to run: the cost to classify the 100k examples was just under $5. It took 18 min to score all the examples, which could probably be improved by better parallelism. If you don’t use any parallelism at all, it will be much slower.

Let’s see the raw responses ChatGPT gives:

chatgpt_preds.resp_pred.value_counts().head(20)
person                                                                                                                            50955
company                                                                                                                           48913
company or person (not enough context to determine)                                                                                  13
it is not clear whether it is a company or a person.                                                                                 11
neither (not a name)                                                                                                                  7
not a name                                                                                                                            6
it is not clear whether this is a company or a person.                                                                                5
cannot be classified as either company or person                                                                                      4
company or person (not enough information to determine)                                                                               4
i'm sorry, i cannot classify this name as it does not appear to be a valid name.                                                      3
neither                                                                                                                               3
it is not clear whether it is a person or a company.                                                                                  2
n/a (not a name)                                                                                                                      2
i am sorry, i cannot classify this name as it does not provide enough information to determine if it is a company or a person.        2
this is not a name.                                                                                                                   2
this is not a valid name.                                                                                                             2
person (assuming it's a misspelling of a person's name)                                                                               2
neither person nor company (not a name)                                                                                               2
person or company (without more context it is difficult to determine)                                                                 2
place                                                                                                                                 2
Name: resp_pred, dtype: int64

For the vast majority of cases, ChatGPT answers as I request: person or company. In very rare cases, it states it doesn’t know, it’s not clear or could be both. What are such examples in practice?

chatgpt_preds[~chatgpt_preds.resp_pred.isin(["person", "company"])][["clean_name", "label", "resp_pred"]].head(10)
clean_name label resp_pred
55 alkj rskolen ringk bing company neither (not a valid name)
55 81355 person cannot be classified without more context
22 telepathic teddy bear person neither
11 agebim company it is not clear whether it is a company or a person.
44 i quit smoking company neither company nor person
33 saint peters church company company (assuming it's a church organization)
88 holy trinity lutheran church akron oh company company (assuming it's a church organization)
55 displayname company company or person (not enough context to determine)
66 ken katzen fine art company company or person (not enough context to determine)
66 columbus high school company company (assuming it's a school)

The names ChatGPT cannot classify are definitely tricky, like “81355” or “telepathic teddy bear”. In some cases, like for “saint peters church”, it does get it right with some extra commentary in parenthesis. All in all, I’d say ChatGPT did an amazing job and failed in a very human way.

Conclusion

I have explored 4 ways to classify names: from a simple logistic regression to a complex neural network transformer. In the end, a general API from ChatGPT outperformed them all without any proper supervised learning.

Method Accuracy (%)
Baseline 89.5
Benchmark 95
FastAI 97.1
Hugging Face 96.9
ChatGPT 97.5

There is a lot of hype around LLMs and ChatGPT, but I’d say it does deserve the attention it’s getting. Those models are transforming tasks that required deep machine learning knowledge into software + prompt engineering problems. As a data scientist, I’m not worried about those models taking over my job, as predictive modelling is only a small aspect of what a data scientist does. For more thoughts on this, check out The Hierarchy of Machine Learning Needs.

Acknowledgements

I’d like to thank Erich Alves for some of the ideas explored in this post. I also thank Erich and Raphael Tamaki for reviewing the post and giving feedback.