Downloading free-7-million-company-dataset.zip to /notebooks
95%|██████████████████████████████████████ | 265M/278M [00:03<00:00, 93.1MB/s]
100%|████████████████████████████████████████| 278M/278M [00:03<00:00, 80.1MB/s]
I explore the problem of name classification with ChatGPT and three machine learning models of increasing complexity: from logistic regression to FastAI LSTM to Hugging Face transformer. To see all the code and reproduce the results, check out the notebook.
Name classification
Can you classify a name as belonging to a person or company? Some are easy, like “Google” is a company and “Pedro Tabacof” is a name. Some are trickier, like “John Deere”. With a labelled dataset, we can train a machine learning model to classify names into entities. This is a simplification of the more general task called Named Entity Recognition. This can also be seen as a simple version of document classification, where the document is simply a name. Due to its simplicity and relation to typical NLP problems, name classification is a good candidate to experiment with different NLP technologies.
When I heard a friend was working on a name classification problem as part of a hiring process, I went straight to ChatGPT to look for answers. I soon realised that ChatGPT can do a great job itself classifying names into entities with just a couple of examples (one-shot learning):
Now, if I actually productionize that prompt using ChatGPT’s API, how would it compare to more traditional alternatives? In NLP, traditional might mean a model from just 5 years ago!
In this post, I explore four ways to classify names into person or company:
- Baseline using word counts and logistic regression: typical baseline for text classification
- FastAI LSTM fine-tuning (whole network): simple fine-tuning with few lines of code
- Huggingface DistilBERT fine-tuning (head only): more involved neural network training using PyTorch
- ChatGPT API one-shot learning: only prompt engineering and post-processing are needed
I use two public datasets available on Kaggle: IMDb Dataset for people names and 7+ Million Company Dataset for companies. Those datasets are large, with almost 20 million names! The choice of datasets was inspired by the open-source business individual classifier by Matthew Jones, which achieves 95% accuracy on this name classification task.
For simplicity, I sample 1M names for training and 100k for testing with a 50-50 balance between companies and people. Since we have balanced classes and ChatGPT cannot produce scores or probabilities (so we cannot use ROC AUC or average precision, definitely a big limitation of ChatGPT), I decided to use the accuracy as the main metric.
Datasets
First, I will download the datasets from Kaggle and do some basic preprocessing. To reproduce the results, you will need a Kaggle account and its command line installed locally. You need to add your API key and username to a file kaggle.json
, which is found in the directory defined by the environment variable called KAGGLE_CONFIG_DIR
.
Downloading imdb-dataset.zip to /notebooks
99%|██████████████████████████████████████▋| 1.05G/1.06G [00:06<00:00, 153MB/s]
100%|███████████████████████████████████████| 1.06G/1.06G [00:06<00:00, 163MB/s]
Archive: free-7-million-company-dataset.zip
inflating: companies_sorted.csv
Archive: imdb-dataset.zip
inflating: data.tsv
I do some preprocessing, inspired by the open-source repo I got the datasets inspiration from: 1. Lower case the people dataset since the companies dataset is all lower case (otherwise I’d suggest keeping the original case, as that can be informative). 2. Remove odd characters and unnecessary spaces. 3. Remove empty and null rows.
= pd.read_csv("companies_sorted.csv", usecols=["name"])
companies
= (
people "data.tsv", sep="\t", usecols=["primaryName"])
pd.read_csv(# Since the companies are all lower case, we do the same here to be fair
=lambda df: df.primaryName.str.lower()).drop("primaryName", axis=1)
.assign(name
)
= pd.concat(
df ="company"), people.assign(label="person"))
(companies.assign(label=1.0, random_state=42)
).sample(frac
= r"""[^a-z0-9\s\'\-\.\&]"""
invalid_letters_pattern = r"""\s+"""
multiple_spaces_pattern
"clean_name"] = (
df[str.lower()
df.name.str.replace(invalid_letters_pattern, " ", regex=True)
.str.replace(multiple_spaces_pattern, " ", regex=True)
.str.strip()
.
)
= df[
df ~df.clean_name.isin(["", "nan", "null"]) & ~df.clean_name.isna() & ~df.label.isna()
"clean_name", "label"]]
][[
10) df.head(
name | label | |
---|---|---|
10103038 | jinjin wang | person |
5566324 | native waterscapes, inc. | company |
8387911 | jeff killian | person |
6783284 | lisa mareck | person |
9824680 | pablo sánchez | person |
6051614 | dvc sales | company |
6479728 | orso balla | person |
4014268 | two by three media | company |
2093936 | house of light and design | company |
11914237 | hamdy faried | person |
From the value counts below, we can see that we have 19.5 million names, 63% being people and 37% companies.
df.label.value_counts()
person 12344506
company 7173422
Name: label, dtype: int64
I sample 550k people and companies to make the dataset balanced and then split into 1M training and 100k testing examples.
= pd.concat(
train_df
(== "company"].sample(n=1_100_000 // 2),
df[df.label == "person"].sample(n=1_100_000 // 2),
df[df.label
)
)
= train_test_split(train_df, test_size=100_000, random_state=42) train_df, test_df
I save the processed datasets for easier iteration. Tip: If you have large datasets, always try to save your preprocessed datasets to disk to prevent wasted computation.
# Saving the processed dataframes locally for quicker iterations
"train_df.csv", index=False)
train_df.to_csv("test_df.csv", index=False)
test_df.to_csv(
# Freeing up the memory used by the dataframes
del companies, people, df, train_df, test_df
gc.collect()
Since I freed up the memory of all datasets, I need to reload them:
# Just run from here if the datasets already exist locally
= pd.read_csv("train_df.csv")
train_df = pd.read_csv("test_df.csv")
test_df
train_df.shape, test_df.shape
((1000000, 2), (100000, 2))
Now, I have one single dataset for training with 500k people and 500k companies and one single test set with 50k people and 50k companies.
Exploratory data analysis
Before I actually get to the fun part, let’s understand the data we have first. I have two hypotheses to explore:
- Do we see a different distribution of words per class? I’d expect some words like “ltd” to be present only in companies and words like “john” to be over-represented in names.
- Does sentence length vary by class? I expect higher range for companies than people, as companies can be from just two characters like “EY” to mouthfuls like “National Railroad Passenger Corporation, Amtrak”. Alternatively, I could look at the number of words per sentence, since most Western names are around 3 words.
Anyway, beware the Falsehoods Programmers Believe About Names.
= (
words_df =train_df.clean_name.str.split(" +"))
train_df.assign(word"word")
.explode("word", "label"])
.groupby([=("clean_name", "count"))
.agg(count
.reset_index()
)
= words_df["count"].sum()
total_words
= words_df.assign(freq=words_df["count"]/total_words)
words_df
= (
person_words == "person"].sort_values("freq", ascending=False).head(25)
words_df[words_df.label
)= (
company_words == "company"].sort_values("freq", ascending=False).head(25)
words_df[words_df.label )
First, let’s take a look at the word counts by label:
We can see our hypothesis was right: Some words are quite predictive of being a person or company name. Note that there is no intersection between the top 25 words for people and companies. This insight implies a simple but effective baseline would be a model built on top of word count, which is what I do next. However, there is a long tail of possible names, so we have to go beyond the most common ones. Another way to see how the distributions differ is by sentence length:
Company names tend to be longer on average and have a higher variance, but interestingly they both peak at 13 characters. I could use sentence length as a feature, but let’s stick to word counts for now.
Baseline: Word counts + Logistic regression
Let’s start with a simple and traditional NLP baseline: word frequency and logistic regression. Alternatively, we could use Naive Bayes, but I prefer logistic regression for its greater generality and easier interpretation as a linear model.
Typically, we use TF-IDF instead of word counting for document classification. Since names are quite short and repetitive words (e.g. “John”) are predictive, I believe it not to be useful here. Indeed, a quick test showed no improvement to accuracy by using TF-IDF.
Another varation is to use n-grams for either words or characters: they’re left as a suggestion to the reader.
= CountVectorizer(analyzer="word", max_features=10000)
text_transformer = text_transformer.fit_transform(train_df["clean_name"])
X_train = text_transformer.transform(test_df["clean_name"])
X_test
= LogisticRegression(C=0.1, max_iter=1000).fit(
logreg == "person"
X_train, train_df.label
)= logreg.predict(X_test)
preds
= accuracy_score(test_df.label == "person", preds)
baseline_accuracy print(f"Baseline accuracy is {round(100*baseline_accuracy, 2)}%")
Baseline accuracy is 89.49%
89.5% accuracy is not bad for a linear model! Remember, since the datasets are balanced, a baseline accuracy without any information would be 50%. Now, whether this is good or bad in an absolute sense, it depends on the actual application of the model. It also depends on the distribution of the words this model would actually see in production. The datasets I used are quite general, containing all kinds of people and company names. In a real application, the names could be more constrained (e.g. only coming from a particular country).
Now, let’s see what mistakes the model makes (error analysis). It’s always interesting to look at examples where the model makes the worst mistakes. If we have a tabular dataset, it might be difficult to interpret what is going on, but, for perceptual data a human can understand (text, image, sound), this leads to invaluable insights into the model.
"proba_person"] = logreg.predict_proba(X_test)[:, 1]
test_df["abs_error"] = np.where(
test_df[== "person", 1 - test_df.proba_person, test_df.proba_person
test_df.label
)
"abs_error", ascending=False)[
test_df.sort_values("clean_name", "label", "proba_person"]
[10) ].head(
clean_name | label | proba_person | |
---|---|---|---|
60581 | co co mangina | person | 0.000206 |
49398 | buster benton and the sons of blues | person | 0.000984 |
6192 | best horizon consulting | person | 0.001613 |
83883 | les enfants du centre de loisirs de chevreuse | person | 0.002633 |
84646 | manuel antonio nieto castro | company | 0.997350 |
32669 | chris joseph | company | 0.996298 |
8545 | hub kapp and the wheels | person | 0.004568 |
77512 | michael simon p.a. | company | 0.994109 |
71392 | dylan ryan teleservices | company | 0.993017 |
64777 | netherlands national field hockey team | person | 0.007220 |
We can see that the mistakes are mostly understandable: There are many companies named just like people. How could the model know Chris Joseph is a company and not a person? The only way would be with information not available in the data I provided for its learning. We also see mislabelings in the people dataset: “netherlands national field hockey team” and “best horizon consulting” do not sound like people names!
This implies a high-leverage activity here would be cleaning the people dataset. If you want to make the data cleaning process sound sexier, just call it data-centric AI (just kidding: data-centric AI is actually a good framework to use for real-life machine learning applications where, in almost all cases, data trumps modelling).
FastAI LSTM fine tuning
For the first more complex machine learning model, let’s start with FastAI due to is simple interface. Following the suggestion of this article, I used an AWD_LSTM model which was pre-trained as a language model that predicts the next word using Wikipedia as dataset. Then, I fine-tune the model with our classification problem. FastAI fine-tune works in the following way: in the first epoch, it only trains the head (the newly inserted neural network on top of the pre-trained language model), then, for all subsequent epochs, it trains the whole model together. FastAI uses many tricks to make the training more effective, which is all wrapped in a simple function call. While convenient, it makes understanding what is going on behind the scenes and any customization more difficult.
= pd.concat((train_df.assign(valid=False), test_df.assign(valid=True)))
fastai_df = TextDataLoaders.from_df(
dls ="clean_name", label_col="label", valid_col="valid"
fastai_df, text_col
)= text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy) learn
5, 1e-2) learn.fine_tune(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.513145 | 0.397019 | 0.802810 | 03:37 |
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.167889 | 0.137422 | 0.952030 | 07:53 |
1 | 0.157485 | 0.145000 | 0.956200 | 07:50 |
2 | 0.122625 | 0.139295 | 0.963160 | 07:53 |
3 | 0.112604 | 0.112886 | 0.968730 | 07:53 |
4 | 0.111916 | 0.111421 | 0.970460 | 07:53 |
Now we ended with 97.1% accuracy, almost 8 percentage points higher than our baseline! Not bad for a few lines of code and one hour of GPU time. Can we do better? Let’s try using a 🤗 transformer.
Hugging Face DistilBERT classification head training
Hugging Face offers hundreds of possible deep learning models for inference and fine-tuning. I chose DistilBERT due to time and GPU memory constraints. By default, Hugging Face trainer will fine-tune all the weights of the model, but now I just want to train the classification head, which is a two-layer fully-connected neural network (aka MLP). The reason is twofold: 1. We’re dealing with a simple problem and 2. I don’t want to leave the model training for too long to make reproducibility simpler and reduce GPU costs. I worked backwards from the previous results: Since FastAI took roughly one hour, I wanted to use the same GPU time budget here.
To only train the classification head, I had to use the PyTorch interface, which allows for more flexibility. First, I download DistilBERTs tokenizer, apply it to our dataset, then download the model itself, mark all layers as requiring no gradient (i.e. not trainable), and then train the classification head.
= 32
batch_size = 3
num_epochs = 3e-5 learning_rate
= AutoTokenizer.from_pretrained("distilbert-base-uncased") tokenizer
First, I apply the DistilBERT tokenizer to our datasets:
= tokenizer(
tokenized_train_df =train_df["clean_name"].tolist(), padding=True, truncation=True
text
)= tokenizer(
tokenized_test_df =test_df["clean_name"].tolist(), padding=True, truncation=True
text )
Now, I create a PyTorch dataset that is the able to hand the input format given by the tokenizer (tokens + attention mask), alongside the labels:
= NamesDataset(
train_dataset == "person").astype(int)
tokenized_train_df, (train_df.label
)= NamesDataset(tokenized_test_df, (test_df.label == "person").astype(int)) test_dataset
= DataLoader(train_dataset, shuffle=True, batch_size=batch_size)
train_dataloader = DataLoader(test_dataset, batch_size=batch_size) test_dataloader
= AutoModelForSequenceClassification.from_pretrained(
model "distilbert-base-uncased", num_labels=2
)
I set all the base model parameters as non-trainable (so that only the classification head is trained):
for param in model.distilbert.parameters():
= False param.requires_grad
Finally, I actually train the model (see full training and evaluation code in the notebook):
epoch 0: test accuracy is 96.527%
epoch 1: test accuracy is 96.776%
epoch 2: test accuracy is 96.854%
We got 96.8% accuracy, essentially the same as FastAI LSTM model. This implies the extra complexity here was for nought. Of course, this problem is a simple one: If I had a more complex problem, I’m sure using a stronger pre-trained language model would give an edge relative to the simpler LSTM trained on Wikipedia. Also, by not fine-tuning the whole network, we miss out on the full power of the transformer. But this suggests that you shouldn’t write off FastAI without trying, which, as I show above, is quite simple.
Let’s see which mistakes this model is making:
"proba_person"] = test_preds
test_df["abs_error"] = np.where(
test_df[== "person", 1 - test_df.proba_person, test_df.proba_person
test_df.label
)"abs_error", ascending=False)[
test_df.sort_values("clean_name", "label", "proba_person"]
[10) ].head(
clean_name | label | proba_person | |
---|---|---|---|
6192 | best horizon consulting | person | 0.000008 |
47006 | rolf schneebiegl & seine original schwarzwald-musi | person | 0.000326 |
58512 | development | person | 0.000363 |
9404 | xin yuan yao | company | 0.999556 |
59585 | cheng hsong | company | 0.999490 |
46757 | compagnie lyonnaise du cin ma | person | 0.000550 |
38224 | pawel siwczak | company | 0.999389 |
25983 | sarah hussain | company | 0.999311 |
23870 | manjeet singh | company | 0.999295 |
73909 | glassworks | person | 0.000776 |
Again, we see cases of clear mislabeling in the case of person and some tough cases in the case of company. Given the accuracy and the worst mistakes, we may be at the limit of what can be done for this dataset without cleaning it. Now, the final question: Can I get the same level of accuracy without any supervised training at all?
ChatGPT API one-shot learning
I will use OpenAI’s API to ask ChatGPT to do name classification for us. First, I need to define the prompt very carefully, what is now called prompt engineering. There are some rules of thumb for prompt engineering. For example, always give concrete examples before asking ChatGPT to generalize to new ones.
The ChatGPT API has three prompt types:
- System: Helps set the tone of the conversation and gives overall directions
- User: Represents yourself, use it to state your task or need
- Assistant: Represents ChatGPT, use it to give examples of valid or reasonable responses
You can mix and match all prompt types, but I suggest starting with the system one, having at least one round of task-response examples, then restating the task that will actually be completed by ChatGPT.
Here, I ask for ChatGPT to classify 10 names into person or company. If I ask for more, say 100 names, there is a higher chance of failure (e.g. it sees a weird string and complains there is nothing it can do regarding the whole batch). If there is still a failure, I do a backup query on each name individually. If ChatGPT fails to provide a clear answer on an individual name, I default to answering “company” since this class contains more problematic strings.
Finally, how can I extract the labels from ChatGPT’s response? It might answer differently, for example, by fixing a misspelling or by using uppercase instead of lowercase (system prompt notwithstanding). In general, it answers in the same order, but can I rely on that completely for all 100k examples? To be safe, I do a simple string matching based on the Levenshtein distance to match the names I query with ChatGPT’s responses.
To reproduce the code below, you need to have an OpenAI account and OPENAI_API_KEY set in your environment
= """
system_prompt You are a named entity recognition expert.
You only answer in lowercase.
You only classify names as "company" or "person".
"""
= "Classify the following names into company or person:"
task_prompt
= """google: company
examples_prompt john smith: person
openai: company
pedro tabacof: person"""
= [
base_prompt "role": "system", "content": system_prompt},
{"role": "user", "content": task_prompt},
{"role": "assistant", "content": examples_prompt},
{"role": "user", "content": task_prompt},
{
]
= []
all_preds
def get_chatgpt_preds(batch_df):
""" Gets predictions for a whole batch of names using ChatGPT's API"""
= base_prompt.copy()
prompt += [{"role": "user", "content": "\n".join(batch_df.clean_name)}]
prompt
= os.getenv("OPENAI_API_KEY")
openai.api_key
try:
# Max tokens as 20000 is enough in practice for 10 names plus the prompt
# Temperature is set to 0 to reduce ChatGPT's "creativity"
# Model `gpt-3.5-turbo` is the latest ChatGPT model, which is 10x cheaper than GPT3
= openai.ChatCompletion.create(
response ="gpt-3.5-turbo", messages=prompt, max_tokens=2000, temperature=0
model
)
# Since we gave examples as "name: class", ChatGPT almost always follows this pattern in its answers
= response["choices"][0]["message"]["content"]
text_result = [
clean_text ":") for line in text_result.split("\n") if ":" in line
line.lower().split(
]
# Fallback query: if I cannot find enough names on the response, I ask for each name separately
# Without it, we'd have parsing failures once every 10 or 20 batches
if len(clean_text) < len(batch_df):
= []
clean_text for _, row in batch_df.iterrows():
= base_prompt.copy()
prompt += [{"role": "user", "content": row.clean_name}]
prompt
= openai.ChatCompletion.create(
response ="gpt-3.5-turbo", messages=prompt, max_tokens=2000, temperature=0
model
)
= response["choices"][0]["message"]["content"]
row_response if ":" in row_response:
clean_text.append(":")[0], row_response.split(":")[-1]]
[row_response.split(
)else:
"company"]) # defaults to company
clean_text.append([row.clean_name,
# To ensure I'm matching the query and the corresponding answers correctly,
# I find the closest sentences in the Levenshtein distance sense
= batch_df.copy()
batch_df = batch_df.merge(pd.DataFrame({"resp": clean_text}), how="cross")
batch_df "resp_name"] = batch_df.resp.str[0].str.strip()
batch_df["resp_pred"] = batch_df.resp.str[-1].str.strip()
batch_df["dist"] = batch_df.apply(
batch_df[lambda row: lev.distance(row.clean_name, row.resp_name), axis=1
)"rank"] = batch_df.groupby("clean_name")["dist"].rank(
batch_df[="first", ascending=True
method
)= batch_df.query("rank==1.0")[["clean_name", "label", "resp_pred"]]
batch_df
# Catches all errors
# Errors only arise due to failures from OpenAI API and should be quite rare
# Ideally, we should keep retrying with exponential backoff
except Exception as e:
print("Exception:", str(e))
= batch_df.copy()
batch_df "resp_pred"] = "company" # defaults to company
batch_df[
return batch_df
= 32
chatgpt_num_workers = 10
chatgpt_batch_size = len(test_df) // chatgpt_batch_size
split_size = np.array_split(test_df, split_size)
test_batches
= Parallel(n_jobs=chatgpt_num_workers, verbose=5)(
chatgpt_preds for batch_df in test_batches
delayed(get_chatgpt_preds)(batch_df) )
= pd.concat(chatgpt_preds)
chatgpt_preds = (chatgpt_preds.resp_pred == chatgpt_preds.label).sum() / len(
chatgpt_accuracy
chatgpt_preds
)
print(f"ChatGPT accuracy is {round(100*chatgpt_accuracy, 2)}%")
ChatGPT accuracy is 97.52%
Incredible! With 97.5% accuracy, ChatGPT managed to outperform complex neural networks trained for this specific task. One explanation is that it used its knowledge of the world to understand some corner cases that the models could not have possibly learned from the training set alone. In some sense, this would be a form of “leakage”: perhaps ChatGPT would be weaker classifying companies founded after its cutoff date (2021).
ChatGPT is also quite cheap to run: the cost to classify the 100k examples was just under $5. It took 18 min to score all the examples, which could probably be improved by better parallelism. If you don’t use any parallelism at all, it will be much slower.
Let’s see the raw responses ChatGPT gives:
20) chatgpt_preds.resp_pred.value_counts().head(
person 50955
company 48913
company or person (not enough context to determine) 13
it is not clear whether it is a company or a person. 11
neither (not a name) 7
not a name 6
it is not clear whether this is a company or a person. 5
cannot be classified as either company or person 4
company or person (not enough information to determine) 4
i'm sorry, i cannot classify this name as it does not appear to be a valid name. 3
neither 3
it is not clear whether it is a person or a company. 2
n/a (not a name) 2
i am sorry, i cannot classify this name as it does not provide enough information to determine if it is a company or a person. 2
this is not a name. 2
this is not a valid name. 2
person (assuming it's a misspelling of a person's name) 2
neither person nor company (not a name) 2
person or company (without more context it is difficult to determine) 2
place 2
Name: resp_pred, dtype: int64
For the vast majority of cases, ChatGPT answers as I request: person or company. In very rare cases, it states it doesn’t know, it’s not clear or could be both. What are such examples in practice?
~chatgpt_preds.resp_pred.isin(["person", "company"])][["clean_name", "label", "resp_pred"]].head(10) chatgpt_preds[
clean_name | label | resp_pred | |
---|---|---|---|
55 | alkj rskolen ringk bing | company | neither (not a valid name) |
55 | 81355 | person | cannot be classified without more context |
22 | telepathic teddy bear | person | neither |
11 | agebim | company | it is not clear whether it is a company or a person. |
44 | i quit smoking | company | neither company nor person |
33 | saint peters church | company | company (assuming it's a church organization) |
88 | holy trinity lutheran church akron oh | company | company (assuming it's a church organization) |
55 | displayname | company | company or person (not enough context to determine) |
66 | ken katzen fine art | company | company or person (not enough context to determine) |
66 | columbus high school | company | company (assuming it's a school) |
The names ChatGPT cannot classify are definitely tricky, like “81355” or “telepathic teddy bear”. In some cases, like for “saint peters church”, it does get it right with some extra commentary in parenthesis. All in all, I’d say ChatGPT did an amazing job and failed in a very human way.
Conclusion
I have explored 4 ways to classify names: from a simple logistic regression to a complex neural network transformer. In the end, a general API from ChatGPT outperformed them all without any proper supervised learning.
Method | Accuracy (%) |
---|---|
Baseline | 89.5 |
Benchmark | 95 |
FastAI | 97.1 |
Hugging Face | 96.9 |
ChatGPT | 97.5 |
There is a lot of hype around LLMs and ChatGPT, but I’d say it does deserve the attention it’s getting. Those models are transforming tasks that required deep machine learning knowledge into software + prompt engineering problems. As a data scientist, I’m not worried about those models taking over my job, as predictive modelling is only a small aspect of what a data scientist does. For more thoughts on this, check out The Hierarchy of Machine Learning Needs.
Acknowledgements
I’d like to thank Erich Alves for some of the ideas explored in this post. I also thank Erich and Raphael Tamaki for reviewing the post and giving feedback.