Some examples by using Elasticsearch-python for searching

Miranda
3 min readMay 19, 2021

--

Recently I am working on data analysis on paper citations. The dataset contains a huge amount of data and elasticsearch might be a good tool to search or process these different types of data. I found there was very few articles talking about how to use elasticsearch API by python. So I want to share some examples and take notes for my recent research on this.

The first thing is installation. The document is good tutorial to start. But don’t forget to download and install the original elasticsearch tool. Keep in mind that without this you cannot use it by python only.

After installing them successfully, you need to run Elasticsearch. Different device may have different way to set up Elasticsearch. If you use MacOS, you may type elasticsearch in Terminal. Or if you use Windows, you might need to configure elasticsearch (see doc) and start from the command line with ./bin/elasticsearch.

Then try the following code, if you don’t raise any error, congratulations! You are connecting successfully!

from elasticsearch import Elasticsearch
es = Elasticsearch("......") # Your server name
print(es.info())

In my case, search is the main thing. And I found the most challenging thing is to write the query. The doc gives some examples for different kinds of search, including filtering and matching or analysis. ES supports several types of input query like a word, or a sentence, even full text. So you can find some details here.

Because I focus on the sentence analysis, I will try most on using full-text methods for doing searching. I know it is very confusing when you first use ES, especially when you need to search something in some conditions which means you need to write a complex query. From my experience, reading other people’s script/query is a good way to enhance your understanding of this. (ES doc is also a good reference.) And this is one of the purposes of this notes.

Here are some my examples:

sent = 'Genetic mutation of IFT proteins and/or the molecular motors that drive this process disrupt cilia structure and function and are responsible for a group of related disorders termed ciliopathies 14.'
body = {
"from": 0, # choose from the top one
"size": 100, # The numbers of result I need
"query": {. # query
"multi_match":{. # find in different fields
"query": sent, # sentence
"analyzer": "standard", # you can choose other analyzer
"fields":['title^2','summary','text']. # The fields I focus on
}
}
}

result = es.search(index="xxxxx",body = body) #index you need to set up

dataset = []
for cand in result['hits']['hits']:
dataset.append(cand['_source']) # dict

In particular, the details you need will always store in result['hits']['hits'] .

test = {'query':{'match':{'id': {'query': 'keyword'}}}}

The above one is to do match . It will works on some keywords you want to match. (id is the field name)

# test2 = {"query": {"regexp": {"id": "medline_.*"}}}

You can also use regression expression regexp to search for items. (id is the field name. )

def count_docs():
body = {
"query": {
"regexp": {"id": "medline_.*"}
}
}
res = es.count(index='xxxxxxx',body = body)
print(res)
Response:
{'count': 23414332, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}}

count (See count) can return number of documents matching a query.

def get_tfidf(sent):
body = {
"doc":{
"summary":nltk.word_tokenize(sent),
"analyzer": "search_quote_analyzer"
},
"term_statistics": True,
"field_statistics": True
}
res = es.termvectors(index='xxxxx',doc_type='documents',body = body)#,term_statistics=True, field_statistics= True)
return res['term_vectors']['summary']

res = get_tfidf(sent)

termvectors (See termvectors)will return information and statistics about terms in the fields of a particular document. It is a good way use this way calculate the doc_freq so that you can get other features like tf-idf. ( summary is field name)

body = {
"from": 0,
"size": candid_num,
"query": {
"bool": {
"must": {"regexp":{"id": "medline_.*"}},
"should":{
"multi_match": {
"query": sent,
"analyzer": "standard",
# "type": "most_fields",
"fields": ["title^2", "summary", "text"]
}
}
}
}

The above query is much more complex since it contains some conditions. bool is built using one or more boolean clauses, each clause with a typed occurrence. It is a good way to do the filtering and can only choose some results you want.

Keep in mind that the results will always return in descending order of their (matching) score. This is my first try to write the blog and share my experience and I will try to keep working and continue my efforts. Hope you enjoy this article.

--

--