Building A Simple Kvasira Integration

Published 29 Nov 2019 by Kvasir Analytics

Accessing the KvasirA API from Python


Previously, we gave a quick tour of the KvasirA API and how to use it. In this post, we’ll show how the API can be used to build a simple command line client for KvasirA using Python (v3):

Before we begin, you may need to install the requests library, like so:

pip3 install requests

Listing available document libraries

As a quick reminder, to get the list of available public document libraries, we need to perform a HTTP GET request to the API using the address

https://demo.kvasira.com/api/libraries/

In Python, this request is simple to do using the requests library:

import json, requests

api_url = 'https://demo.kvasira.com/api/libraries'
r = requests.get(api_url, headers={
  'Content-Type': 'application/json',
})

response = json.loads(r.text)

A successful request will return a status code of 200. In this case printing the available document libraries in alphabetical order is easy:

from operator import itemgetter

libraries = response['data'] if r.status_code == 200 else []

for library in sorted(libraries, key=itemgetter('title')):
    title, library_id, running, description = \
      library['title'], library['id'], library['running'], library['description']
    if running:
        print(f'{title} ({library_id}) - {description}')

The output should look as follows:

ArXiv (arxiv) - ArXiv papers
Hansard (hansard) - UK House of Commons speeches 1979-2018
Patents (patents) - USPTO patents - data provided by PatentsView
RFC (rfc) - Internet RFCs and drafts
Twitter (twitter) - A random sample of tweets from 2017-2018
Wiki (AR) (arwiki) - Arabic Wikipedia articles
Wiki (EN) (enwiki) - English Wikipedia articles
Wiki (ES) (eswiki) - Spanish Wikipedia articles
Wiki (HI) (hiwiki) - Hindi Wikipedia articles
YouTube (youtube) - A sample library of 90 000 YouTube videos


Querying a document library

To query a document library, we need to issue a POST request to

https://demo.kvasira.com/api/library/LIBRARY_ID/query?query_type=[url|text]&k=N,

where LIBRARY_ID is the id of the document library we want to query, N is a parameter that indicates the desired number of results, and the query_type parameter specifies the query type. In Python, this request can be done as follows:

target_url = 'https://en.wikipedia.org/wiki/Merge_sort'
library_id = 'enwiki'
k=6

call_url = f'https://demo.kvasira.com/api/library/{library_id}/query?query_type=url&k={k}'
r = requests.post(call_url, data=json.dumps({'doc': target_url}), headers={
                    'Content-Type': 'application/json',
                 })
response = json.loads(r.text)

Looping over the results and printing them is easy:

if r.status_code == 200:
  for i, result in enumerate(response['response']['results'], start=1):
    title, url = result['title'], result['uri']
    print(f'{i}. {title} - {url}')

This will give us neatly formatted query results, for example:

1. Merge sort - https://en.wikipedia.org/wiki/Merge_sort
2. Sorting algorithm - https://en.wikipedia.org/wiki/Sorting_algorithm
3. Merge algorithm - https://en.wikipedia.org/wiki/Merge_algorithm
4. Insertion sort - https://en.wikipedia.org/wiki/Insertion_sort
5. Quicksort - https://en.wikipedia.org/wiki/Quicksort
6. External sorting - https://en.wikipedia.org/wiki/External_sorting


The full story

A complete command line application with neater output, error handling and argument parsing is given below. It requires the requests and colorama libraries which can be installed using the Python package manager pip with pip3 install requests colorama.

import argparse
import json
import requests
import sys
from operator import itemgetter
from colorama import Fore, Style

BASE_URL = 'https://demo.kvasira.com/api/'

def print_query_error(message):
    print(f'{Fore.RED}Query failed: {message}{Style.RESET_ALL}',
          file=sys.stderr)

def print_query_success(response, print_summary):
    results = response['results']
    for i, result in enumerate(results, start=1):
        title, url = result['title'], result['uri']
        if print_summary:
            print(Fore.GREEN, end='')
        print(f'{i}. {title} - {url}{Style.RESET_ALL}')
        if print_summary:
            print(result['summary'])
            if i != len(results):
                print()

def print_libraries(libraries):
    for lib in sorted(libraries, key=itemgetter('title')):
        title, library_id, running, description = \
                lib['title'], lib['id'], lib['running'], lib['description']
        color = Fore.GREEN if running else Fore.RED
        print(f'{color}{title} ({library_id}) - {description}{Style.RESET_ALL}')

def get_libraries():
    call_url = BASE_URL + 'libraries'
    try:
        response = requests.get(call_url, headers={
            'Content-Type': 'application/json',
        })
        return json.loads(response.text), response.status_code
    except:
        return None, None

def query(library_id, url, k):
    call_url = BASE_URL + f'library/{library_id}/query?query_type=url&k={k}'
    try:
        response = requests.post(
        call_url, data=json.dumps({'doc': url}),
        headers={
            'Content-Type': 'application/json',
        })
        return json.loads(response.text), response.status_code
    except:
        return None, None

def check_valid_n(value):
    ivalue = int(value)
    if not 1 <= ivalue <= 20:
        raise argparse.ArgumentTypeError(
        f'{value} is invalid -- must be within 1..20')
    return ivalue

def parse_arguments():
    parser = argparse.ArgumentParser(description='KvasirA query tool.')
    parser.add_argument('library', help='the document collection to query')
    parser.add_argument('-u', '--url', help='the URL to query', default='')
    parser.add_argument('-n', '--nresults', type=check_valid_n, default=10,
                        help='number of results')
    parser.add_argument('-s', '--summary', dest='summary', action='store_true',
                        help='display summaries (default)')
    parser.add_argument('--no-summary', dest='summary', action='store_false',
                        help='do not display summaries')
    parser.set_defaults(summary=True)
    return parser.parse_args()

def main():
    args = parse_arguments()

    libraries_response, libraries_status = get_libraries()
    if libraries_response is None or libraries_status != 200:
        print('Unable to fetch document libraries', file=sys.stderr)
        return

    libraries = libraries_response['data']
    if args.library == 'libraries':
        print_libraries(libraries)
        return

    match = next((lib for lib in libraries if lib['id'] == args.library), None)
    if match is None:
        print('Library {args.library} not found', file=sys.stderr)
        return

    query_response, query_status = query(match['id'], args.url, args.nresults)

    if query_response is None:
        print_query_error('Connection failed')
    elif query_status != 200:
        print_query_error(query_response['response'])
    else:
        print_query_success(query_response['response'], args.summary)

if __name__ == '__main__':
    main()

Let’s save our script as kquery.py. To get the available libraries, use python3 ./kquery.py libraries:

Querying is easy: python3 ./kquery.py enwiki -u https://en.wikipedia.org/wiki/Merge_sort -n 3:

If you don’t want to see the summaries, you can suppress them with the --no-summary flag:

If you have a great use case for KvasirA in mind and need help integrating it into your own app, contact us at contact@kvasira.com!