🔍Building The Best Emoji Search Engine

Xitang Zhao
Posted on Nov 2, 2024
Not in the mood of reading? Listen as NotebookLM podcast

🔥Motivation

Emojis has become the world’s “fastest-growing language” because it allows us to convey emotions and feelings that are otherwise difficult to express through text alone. Nowadays, over one in five tweets includes an emoji, and emoji usage will continue to grow in the future📈

As an emoji enthusiast, I use emojis daily in messages, emails, and docs to make text more joyful to read and easier on the eyes. However, a common problem I frequently run into is not being able to search and find the right emoji to use. For instance, I like to prefix each section header in a doc with an emoji to better differentiate sections, but existing emoji search tool can’t find matching emojis even for common section headers like “goal”, “problem”, or “solution”. I also like to express kudos to colleagues and would send “Amazing work” followed by an emoji, only to struggle with finding the best emoji that captures “amazing”. These experience sucks. Is there an emoji for “sucks”? There sure is → 🫤, but it has yet shown up in current search tool results.

The limitation of the current emoji search tool is preventing people from fully expressing their emotions and feelings, and hindering emojis from reaching its maximum impact. As an avid emojier and builder, I believe we can do better, so I set out on a quest to build the best emoji search engine that would allow people to easily find the emojis they want and use emojis joyfully. This post documents some thinkings and learnings along the way.

🔍Search Engine

The term "search engine" typically refers to a web search engine, with Google being the world's most popular example. A web search engine takes a user query as input, performs efficient search processing, and returns relevant websites to the user. Similarly, an emoji search engine takes a user query as input, performs searches, and returns relevant emojis.

Google has published a How Search Works page that nicely outlines how a search engine works at a high level. While a web search engine tackles a much larger problem, i.e. searching through ~1 billion websites as opposed to ~2k emojis, the high level ideas pertaining to search and processing are similar and the page serves as a helpful reference and an inspiration for creating an emoji search engine.

At its core, a search engine contains two key components: 1. Keyword database, and 2. Ranking algorithm. In the following sections, I will dive into the technical details of how I built the Emoogle emoji search engine, named after Emoji + Google, with the aim of creating a top notch search experience for emojis.

🗃️1. Keyword Database

The first key component of a search engine is a keyword database that contains numerous keywords linked to the searchable items. Keywords allows searches to perform efficiently by comparing user’s input query with matching keywords to return only corresponding results. In the case of web search engine like Google, it crawls websites and extracts keywords from each site to build the keyword database. When a user enters an input query, the search engine quickly checks the database and returns only websites with matching keywords. For an emoji search engine, the keyword database consists of emoji keywords, which are any words or phrases that match or relate to specific emojis. Unlike websites where content typically contains the keywords and the work is to extract them, emojis have no keywords to begin with and the work is to build up and construct the keywords from scratch.

➡️Emoji To Keyword

My first step to seed the Emoogle keyword database was by adding all emojis and their names. The emoji’s name is used as the first keyword for each emoji because it effectively captures and conveys the essence of what the emoji is. This was done by pulling from the Unicode emoji-test.txt file, which is an emoji data file published annually by Unicode and contains all the latest emojis and names. At the time of creation, the emoji data file is version 15.1 and contains 1872 emojis (after deduping skin tone variations). Using the name alone adds 1724 unique words to the keyword database. As a quick note, unique words is defined as the number of unique English words in the keyword database and is the primary metric I use to benchmark the keyword database in this post. Generally speaking, the more unique words the database contains, the more powerful and the better it is as it can match more user input queries/words. (The keyword database created at this step can be founded in the json file unicode-emoji-keywords.json and the processing script used can be viewed in create-unicode-emoji-keywords.ts)

My second step was to supplement the keyword database with the Unicode Common Locale Data Repository (CLDR)’s emoji keywords list. Unicode CLDR is a project under the Unicode Consortium that contributes greatly to software localization. One of its key works is to standardize the names and keywords for emojis, where the group has a well structured process that curates emoji names and keywords through proposals and voting with helps from linguists, localization experts, and community contributions (link). In fact, the emoji names added in the first step is actually derived from CLDR. The addition of CLDR keywords from this step boosted the keyword database to 2802 unique keywords. As a fun fact, as far as I can tell, the Windows emoji picker search engine seems to be solely based on CLDR emoji names and keywords. (The keyword database created at this step can be founded in the json file cldr-emoji-keywords.json and the processing script used can be viewed in create-cldr-emoji-keywords.ts)

My third step was to further supplement the keyword database by referencing and pulling additional data from various sources. A non-exhaustive lists are shown below

  • emojilib - open source emoji keywords library with 3302 unique keywords (before 4.0 release)
  • gitmoji - open source emoji guide for commit messages & programmers
  • FindReplace.json - leaked Apple emoji keywords (some additional keywords in System/Library/PrivateFrameworks/CoreEmoji.framework were also included)
  • Emojipedia - online emojis encyclopedia to check emoji definition and usage
  • Urban Dictionary - crowdsourced online dictionary to check how people define and use emojis
  • ChatGPT - LLM chatbot to help brainstorm emoji keywords and symbolic words
  • SERanking - emoji keywords based on people’s search pattern in web search engine
  • Notion emoji keywords
  • Various other sites or reddit posts pertaining to emojis

Going through step three bumped the keyword database to ~4000 unique keywords.

⛓️‍💥Emoji To Keyword Shortcoming

After all these works however, I would still run into issues where I couldn’t search the emojis for certain words from time to time. Why would this be the case? Before sharing an answer, let’s first discuss the 2 different types of keywords an emoji has:

  1. The first type of keywords are directly related to the emoji, e.g. what the emoji is and its synonyms
  2. The second type of keywords are indirectly related to the emoji, e.g. what does the emoji symbolize or relate to

Steps 1-3 curates keywords by going through each emoji. It is an emoji to keyword approach and works great at filling out directly related keywords, but falls short for indirectly related keywords. Let’s use the robot emoji 🤖 as an example. It is straightforward to come up with bot, android, machine as its keywords, which are directly related. However, it is rather difficult to come up with indirectly related keywords such as AI, technology, automation, engineering, which are all appropriate as robot symbolizes all these words in one shape or form.

The reason why going through steps 1-3 still fails at searching some emojis is because many words we use from day to day don’t directly relate to an emoji and requires a bit more thinking and imagination to draw the connection from the word to an emoji, thus these words are missing in the keyword database using an emoji to keyword approach.

↩️Keyword to Emoji

In order to enable better emojis search for common day to day words and include more indirectly related keywords, it requires the opposite approach: keyword to emoji, which goes through each word intentially and find the best matching related emojis for the word, and then adds it to the keyword database.

Going through each word might sound like a wild idea at first but isn’t too crazy when taking a closer look. There are 170k actively used English words according to Oxford English Dictionary, but the average person only speaks 7k-20k words according to wordmetrics.org. On the lower end, 7k is ~4 times the number of emojis, so it isn’t as bad as we might think.

Still, it would be quite some manual works for any person to go through thousands of words and assign best matching emojis. This brought me to step 4, which was to enrich the keyword database with a selective list of most frequently used words. The list I used was the free top 5000 words of the Word frequency data, which claims as probably the most accurate word frequency data for English and is based on the one billion word Corpus of Contemporary American English (COCA). I went through the top 1000 words in this list, found the best matching emojis, and added it to the keyword database to match emojis for day to day words, e.g. goal → 🎯, problem → ❓, solution → 💡, amazing → 🤩, stupid → 🤡, promote → 📣, etc. Mapping a word to an emoji requires some imaginations, so I sometimes searched the word in Google Image to help with brainstorming. Additionally, my focus was on nouns, adjectives and verbs while skipping articles and prepositions, as the formers are more relevant for emoji keywords. While I was able to match most words to emojis, some words are not really matchable, e.g. happen, general, very, etc. We could probably still match them, but I left it out for now since I don’t think there would be practical use cases for them.

As someone who lives and breaths in engineering and entrepreneurship, I want to make sure the keyword database covers the words people regularly use in these two domains. This led to step 5, which was to boost the keyword database with a selective list of most frequently used words in engineering and entrepreneurship. I curated this word list by pulling the top 1000 most frequently used words from Martin Kleppmann’s iconic Designing Data-Intensive Applications and Uri Levine’s epic Fall in Love with the Problem, Not the Solution, and then mapped the words to emojis, e.g. database → 🗃️🗄️, replication → 📚🧬, guarantee → 🛡️, users → 👥👨‍👩‍👧‍👦, failure → 🏳️, iteration → 🔁, etc. Many words in the top of the list are articles/prepositions/conjunctions (e.g. the, to, and, etc) and some words have been covered already, so the net new words added are smaller due to fewer content words and duplications. Some fun facts discovered during this analysis are: “data” is the top most frequently used content word in Martin's book and appears 1950 times, while “users” is the top most frequently used content word in Uri's book and appears 460 times. Uri's book itself has about ~7k unique words and coincidentally matches the number of words a person speaks daily.

Step 6 was continuous improvement. Whenever I came across a new word I need to emojify in my day to day, I would find the best matching emoji and then add it to the Emoogle keyword database. I have done it for over half a year, and many words were added to its best matching emoji, e.g. pull request → 🔀, unfortunately → 😔, unknown → ❓, ceo → 👑, etc.

As a result of these 6 steps, the Emoogle keyword database now contains a whopping 5424 unique words at the time of writing and can be found in emoogle-emoji-keywords.json, making it the largest publicly published emojis keyword database and sufficiently good to find matching emojis on many day to day words.

Still, there is only so much that one person can do. While these steps provide the keyword database with a good starting point, it still has a lot of rooms for improvement. To make the keyword database truly the best, it would require step 7, which is community contribution to receive more inputs and diverse usage from the broader community. If you are an emoji enthusiast and passionate about enhancing the emoji search experience, feel free to contribute to 🐶Emoogle Emoji Keywords Database spreadsheet and add new keywords directly, or submit a GitHub pull request to the Emoogle Emoji Search Engine repo if you are a developer.

🏆2. Ranking Algorithm

Having a vast keyword database gets us half way toward building the best emojis search engine. The other half and second key component of a search engine is a ranking algorithm that ranks the search results and sorts the most relevant results first. This is important because a keyword match can return many results at once and the ranking algorithm ensures the most relevant ones are sorted to the top. For example, “a” matches various emojis: 😘 (face blowing a kiss), 💠 (diamond with a dot), 🅰️ (A button blood type), 🧮 (abacus), 🔤 (abc). A great ranking algorithm should order 🅰️ first because the letter A emoji matches the input query “a” precisely.

🧠Rule-based Ranking Algorithm

There are different approaches to creating a ranking algorithm. Two common approaches are: 1. rule-based, and 2. score-based. Web search engine like Google uses a score-based algorithm that assigns a score to each website based on various attributes, which include keywords or content match, authority or trustworthiness of the website, personalization factors based on user’s location and past search history, etc. The websites are then returned in order from highest to lowest scores. A score-based algorithm is powerful and can handle many attributes simultaneously, but introduces additional complexities that make the sorted results more difficult to reason about and tweak.

For the purpose of ranking emojis results, a rule-based algorithm would be a simpler and clearer alternative, which defines a set of rules that determine the rank and order of an emoji. Algolia, a popular search-as-a-service platform, advocates for this rule-based approach and calls it a tie-breaking algorithm. Here is how it works: when comparing the ranks of two emojis, both are checked against a set of rules. If both meet the first rule, it is a tie. The algorithm then checks the next rule until there is a tie-break, where one emoji meets the rule and the other doesn’t, in which case the emoji that meets the rule ranks higher.

📏Ranking Rules

The Emoogle ranking algorithms consists of 10+ rules to handle various cases. In this section, I will focus on the primary use case where the input is a single word and discuss 7 key rules associated with it.

  1. Exact match ranks higher than prefix match

Matching an input word with a keyword results in 3 cases: 1. exact match, 2. prefix match, and 3. no match. No match is straightforward and means that the emoji keyword doesn’t match the input and therefore shouldn’t appear in the results. Exact match means the input word is identical to the keyword, while prefix match means the input word only matches the beginning of the keyword. For example, the input word “arm” matches the following 3 emojis: 💪 (arm), 🛡️ (armour), 🪖 (army). 💪 (arm) is an exact match, while 🛡️ (armour) and 🪖 (army) are prefix matches. 🕹️ (arcade) would be a no match, because arcade doesn’t start with “arm”. The first rule is a reasonable rule that ranks the exact match higher than prefix match, because the exact match better aligns with user’s input.

  1. Default most relevant emoji ranks higher

An input word sometimes has multiple exact matches. For example, "clean" exactly matches the keywords for these emojis: 🛀, 🚿, 🧹, 🧼. These emojis are returned in this default order as it is the order of the Unicode data set. However, while 🛀 and 🚿 relate to showering and cleaning oneself, "clean" more commonly refers to cleaning rooms and spaces. Thus, 🧹 is the most relevant match and ranks higher. How this works in practice is that I have curated a list of default most relevant emoji for some keywords (500+ at the time of writing), which gives an emoji a higher rank for keyword match, e.g. a → 🅰️, accessibility → ♿, address → 📍, buy → 🛒, fast → ⚡, travel → ✈️. This list is curated based on experience and usage of what is more relevant. It can be found at emoogle-keyword-most-relevant-emoji.json and is open to community contribution to expand and improve upon.

  1. User preferred most relevant emoji ranks higher

This rule is similar to the previous one and allows user preference to take higher precedence over the default most relevant emoji to personalize search result. Using the same example of the input word “clean”, say a user prefers using the 🧼 emoji to denote it, 🧼 would rank first in the result. User preference has to be provided to the search engine for this to work correctly. In the Emoogle desktop app, the app locally stores the emoji preference that the user has selected for a given input word and feed the preference data into the search engine to enhance the search experience.

  1. Keyword in emoji name ranks higher

The Emoogle emoji keyword database uses the emoji name as an emoji’s keyword. This rule ranks a keyword higher if the keyword is in the emoji name, as the emoji name generally better captures the word. For example, the input “1” matches 1️⃣, ☝️, 🕐, and 1️⃣ ranks first because it’s emoji name “keycap: 1” contains “1”.

  1. Single word keyword ranks higher than phrase keyword with multiple words

An emoji keyword can be a single word or a phrase with multiple words. For example, 🔖 has “bookmark” as its single word keyword, while📑 has “bookmark tabs” as its phrase keyword. This rule ranks single word keyword match higher than phase keyword match, as single word keyword is more relevant. So if user searches for “bookmark”, 🔖 ranks higher than 📑 because the latter has more words.

The above 5 rules apply to exact match. There are 2 additional rules that apply to prefix match only.

  1. Keyword that is more commonly used ranks higher

Prefix match can potentially match many keywords. For example, “h” matches 💧(h2o), 😆(haha), 💇(haircut) etc, where the default order is alphabetical. This rule uses the top 1000 words to order words based on their frequency of use, so searching “h” returns 🤝(help), 📈(high), 🏠(home) etc to increase relevancy as "help", "high", "home" are the most used words. The top 1000 words used is created from the free top 5000 words of the Word frequency data with modification to only include emoji matchable nouns, adjectives and verbs. The file can be viewed in top-1000-words-by-frequency.json.

  1. Keyword that is recently searched ranks higher

This rule is similar to the previous one and allows user's recent usage to take higher precedence over the common usage to personalize search result. Using the same example of the input “h”, say a user recently searches and uses “hello” 👋, “h” would return 👋(hello) first followed by 🤝(help), 📈(high), 🏠(home).

These 7 rules are the primary rules and strategies used to rank emojis by using a combination of match type, most common/relevant usage, personalization, and keyword type. There are additional rules that handle various other cases, and more details can found in the codebase. If the input query is a phrase with multiple words, different strategies are used to compute match, e.g. checking against phrase keyword and checking against all keywords for possible match.

🐶 Emoogle Emoji Search Engine

The vast keyword database combined with a thoughtfully crafted ranking algorithm gives birth to the Emoogle emoji search engine. When given an input query/word, the search engine first loops through all emojis and their keywords in the keyword database to search for matching emojis and then sorts the matching emojis using the ranking algorithms before returning the results - all while happening blazingly fast and taking ~10ms on average.

The Emoogle emoji search engine supports 2 use cases:

  1. Search emojis as you type
  2. Search best matching emojis for a sentence

Both use cases are similar, with one key difference: the first narrows emoji results as the user types, since it matches for all input words, while the second is more flexible by loosening this requirement and also matching stemmed words, since the second use case is to find the best matching emojis.

🆚Algorithmic Approach vs Machine Learning Approach

The Emoogle emoji search engine is created in 2024, one-ish years after Large Language Model (LLM) has gotten viral since late 2022. Despite this, the approach I have chosen to build it is rather manual, traditional, and algorithmic. You might wonder: could this emoji search engine be built using ML or LLM instead, and what differences would that make? The answer is certainly yes, as any problem can be solved using various approaches, each with its own trade-offs. Our quest to build the best emoji search engine wouldn’t be completed if we are just fixated on one approach without considerations of others. So let’s examine a ML approach before we conclude.

A possible ML approach to create an emoji search engine would be using text embeddings, which convert text into a vector of numbers. In another words, text embedding creates a numerical representation for text and puts it in a sequence of numbers. In an embedding vector space, semantically similar words or sentences are closer in vector distances. For example, “smile” is closer to “happy” than to “sad” in the vector space. To make this easier to conceptualize, imagine there is a line, “happy” is a dot on the left end of the line and “sad” is on the right, “smile” would be a dot in between but is closer to “happy”. While a line is only one dimension, a vector space contains hundreds or thousands of dimensions, but the idea is similar.

Leveraging the fact that semantically similar words are closer in vector distances, we can first precalculated the embeddings for all emojis using the emojis and their names to obtain the emoji vectors. Next, the search engine can convert the input query to an embedding vector, find the top closest emoji vectors to this input query vector, and output the corresponding emojis as results. This approach has been implemented on emojisearch.app by Lilian Weng using OpenAI’s text-embedding-ada-002 model. While the model she used is closed source and costs money to use, it’s possible to create a free version using an open source model such as nomic-embed-text.

Now, let's compare the advantages and disadvantages between the 2 approaches:

🧠Algorithmic Approach Advantages

There are 3 areas where the algorithmic approach outperforms the text embedding approach

  1. Smaller size and faster speed

The algorithmic search is backed by a lightweight emoji keyword database of only ~60KB (gzipped). It runs blazingly fast and takes ~10ms on average to loop through 12k+ keywords. In contrast, text embedding search uses trained model that’s a few hundreds MB in size - 1000 times larger than the emoji keyword database. It’s also computationally more expensive, taking at least 10 times longer to run. This is due to the need to compute vector distances for ~2k emoji vectors, each with 1-2k dimensions depending on the model - that is 2-4 million values to process.

  1. Better clarity and customizability

The algorithmic search uses a set of fixed rules to search and rank emojis. This approach yields clear outputs that are easy to debug if issues arise. Moreover, it's straightforward to customize the search by adding keywords or adjusting ranking orders. In contrast, text embedding model is a black box that is impossible to debug or decipher. Adding keywords is difficult and requires fine-tuning the model.

  1. Well suited for "search emojis as you type" use case

The algorithmic search, being keyword-based, can match input query prefixes and display relevant results as soon as the user starts typing. Moreover, it can be configured with autocomplete to offer a smooth and responsive search experience. In contrast, text embedding search doesn't match prefixes and requires users to type out full words before finding relevant emojis.

🤖Text Embedding Approach Advantages

However, there are 3 areas where the algorithmic approach lacks behind and the text embedding approach excels.

  1. Language agnostic

A major limitation of the algorithmic approach is its specifically designed for the English language. It doesn't work for other languages without additional effort or adding language specific keywords. In contrast, the text embedding approach doesn't have this limitation if the model is trained with multilingual documents.

  1. Breath of words

Another limitation of the algorithmic approach is that search only works if the input query word is in the keyword database. While the keyword database has been optimized for common everyday word use, it might not work for less frequently used words. In contrast, the text embedding approach is trained on vast amounts of documents and thus can recognize much more words.

  1. Understand phrases and context

The algorithmic approach works well for a single word but has limitations on a phrase or a sentence containing multiple words, because it lacks the ability to understand the relationship between words in context. For example, when searching for the best matching emoji for “raining dogs and cats”, the algorithmic approach returns the Paw Prints emoji 🐾 as the top result. This is because 🐾 has “dog” and “cat” as keywords and the algorithm ranks emojis with more keyword matches higher. In contrast, the text embedding approach is able to understand the figurative meaning of the phrase to refer to heavy rain and can potentially return Cloud with Rain 🌧️ emoji (Though in practice, emojisearch.app returns ☂️ as the first result).

♾️The end is the beginning

Having walked through the creation journey of the Emoogle emoji search engine and compared it with an ML approach, I'm excited to mark the end of this quest. I've fulfilled my initial goal — building the best emoji search engine that allows people to easily find the emojis they want and use emojis joyfully. The Emoogle emoji search engine is the best in terms of speed, size, customizability, and search-as-you-type experience for the English language. However, this is just the beginning as it still has limitations around the breath of its keyword database and its ability to understand phrases and context. The Emoogle emoji search engine and its keyword database are open source and welcome contributions for enhancement and to expand its keyword database over time. As ML/LLM continues to evolve with smaller and more capable models, it may make sense to incorporate ML into the search engine in the future. Until then, happy emoji-ing 😆

😉If you enjoy reading this post, be sure to check out the Emoogle desktop app for the best emoji experience for free.