Corpus search

class chaininglib.search.CorpusQuery.CorpusQuery(resource, pattern=None, lemma=None, word=None, pos=None, detailed_context=False, extra_fields_doc=[], extra_fields_token=[], start_position=0, max_results=9223372036854775807, metadata_filter={}, method=None)[source]

A query on a token-based corpus.

detailed_context(detailed_context=True)[source]

Request a CorpusQuery object to return a detailed context.

Parameters:detailed_context – If True, every single tokens will be returned with multiple information layers (like lemma, wordfor, part-of-speech, …). If False, only hits will have multiple information layers
Returns:CorpusQuery object
extra_fields_doc(extra_fields_doc)[source]

Request a CorpusQuery object to return the named document metadata fields.

Parameters:extra_fields_doc – List of extra document metadata fields
Returns:CorpusQuery object
extra_fields_token(extra_fields_token)[source]

Request a CorpusQuery object to return the named extra token layers.

Parameters:extra_fields_token – List of extra token layers
Returns:CorpusQuery object
kwic()[source]

Get the Pandas DataFrame with one keyword in context (KWIC) per row

Returns:Pandas DataFrame
max_results(max_results)[source]

Limit the maximum number of results returned.

Parameters:max_results – maximum number of results.
Returns:CorpusQuery object
metadata_filter(metadata_filter)[source]

Set metadata fields to filter results set on. If method is FCS, results will be filtered after the request. For Blacklab, filtered results will be requested from the server.

Parameters:metadata_filter – Dictionary of conditions. The key represents the column to be filtered. If the value is a string, the value will be matched exactly. If the value is a list, it will be interpreted as a numerical interval.
Returns:CorpusQuery object
method(method)[source]

Set method to make request

Parameters:method – fcs (Federated Content Search) or blacklab
Returns:CorpusQuery object
search()[source]

Request results matching a corpus search query

Returns:CorpusQuery object
>>> # build a corpus search query
>>> corpus_obj = create_corpus(some_corpus).pattern(some_pattern)
>>> # get the results
>>> df = corpus_obj.search().kwic()
start_position(start_position)[source]

Request a CorpusQuery object to return the stated page number of the whole result pages collection. This option might not be used by users, but the search procedure needs this to be able to retrieve full results, as those might be spread among more pages.

Parameters:start_position – result page number to be requested.
Returns:CorpusQuery object
xml()[source]

Get the XML response (unparsed) of a Corpus search

Returns:XML string
>>> corpus_obj = create_corpus(some_corpus).pattern(some_pattern)
>>> xml = corpus_obj.search().xml()
chaininglib.search.CorpusQuery.create_corpus(name)[source]

API constructor

Parameters:name – corpus name
Returns:CorpusQuery object
>>> corpus_obj = create_corpus(some_corpus).pattern(some_pattern)
>>> df = corpus_obj.search().kwic()
chaininglib.search.CorpusQuery.get_available_corpora(exclude=[])[source]

This function returns the list of the available corpora

Returns:list of corpus name strings
>>> # get list of corpora at our disposal and query each of them
>>> for one_corpus in get_available_corpora(exclude=["nederlab"]):
>>>     c = create_corpus(one_corpus).lemma("woordenboek").detailed_context(True).search()
>>>     df_corpus = c.kwic()
chaininglib.search.corpusQueries.corpus_query(lemma=None, word=None, pos=None)[source]

This function builds a query for getting occurances of a given lemma within a given corpus.

Parameters:
  • lemma – a lemma to look for
  • word – word form to look for
  • pos – POS tag to look for
Returns:

a corpus query string

>>> lemma_query = corpus_query(lemma="lopen")
>>> df_corpus = create_corpus("zeebrieven").pattern(lemma_query).kwic()
>>> display(df_corpus)