Corpus search¶
-
class
chaininglib.search.CorpusQuery.
CorpusQuery
(resource, pattern=None, lemma=None, word=None, pos=None, detailed_context=False, extra_fields_doc=[], extra_fields_token=[], start_position=0, max_results=9223372036854775807, metadata_filter={}, method=None)[source]¶ A query on a token-based corpus.
-
detailed_context
(detailed_context=True)[source]¶ Request a CorpusQuery object to return a detailed context.
Parameters: detailed_context – If True, every single tokens will be returned with multiple information layers (like lemma, wordfor, part-of-speech, …). If False, only hits will have multiple information layers Returns: CorpusQuery object
-
extra_fields_doc
(extra_fields_doc)[source]¶ Request a CorpusQuery object to return the named document metadata fields.
Parameters: extra_fields_doc – List of extra document metadata fields Returns: CorpusQuery object
-
extra_fields_token
(extra_fields_token)[source]¶ Request a CorpusQuery object to return the named extra token layers.
Parameters: extra_fields_token – List of extra token layers Returns: CorpusQuery object
-
kwic
()[source]¶ Get the Pandas DataFrame with one keyword in context (KWIC) per row
Returns: Pandas DataFrame
-
max_results
(max_results)[source]¶ Limit the maximum number of results returned.
Parameters: max_results – maximum number of results. Returns: CorpusQuery object
-
metadata_filter
(metadata_filter)[source]¶ Set metadata fields to filter results set on. If method is FCS, results will be filtered after the request. For Blacklab, filtered results will be requested from the server.
Parameters: metadata_filter – Dictionary of conditions. The key represents the column to be filtered. If the value is a string, the value will be matched exactly. If the value is a list, it will be interpreted as a numerical interval. Returns: CorpusQuery object
-
method
(method)[source]¶ Set method to make request
Parameters: method – fcs (Federated Content Search) or blacklab Returns: CorpusQuery object
-
search
()[source]¶ Request results matching a corpus search query
Returns: CorpusQuery object >>> # build a corpus search query >>> corpus_obj = create_corpus(some_corpus).pattern(some_pattern) >>> # get the results >>> df = corpus_obj.search().kwic()
-
start_position
(start_position)[source]¶ Request a CorpusQuery object to return the stated page number of the whole result pages collection. This option might not be used by users, but the search procedure needs this to be able to retrieve full results, as those might be spread among more pages.
Parameters: start_position – result page number to be requested. Returns: CorpusQuery object
-
-
chaininglib.search.CorpusQuery.
create_corpus
(name)[source]¶ API constructor
Parameters: name – corpus name Returns: CorpusQuery object >>> corpus_obj = create_corpus(some_corpus).pattern(some_pattern) >>> df = corpus_obj.search().kwic()
-
chaininglib.search.CorpusQuery.
get_available_corpora
(exclude=[])[source]¶ This function returns the list of the available corpora
Returns: list of corpus name strings >>> # get list of corpora at our disposal and query each of them >>> for one_corpus in get_available_corpora(exclude=["nederlab"]): >>> c = create_corpus(one_corpus).lemma("woordenboek").detailed_context(True).search() >>> df_corpus = c.kwic()
-
chaininglib.search.corpusQueries.
corpus_query
(lemma=None, word=None, pos=None)[source]¶ This function builds a query for getting occurances of a given lemma within a given corpus.
Parameters: - lemma – a lemma to look for
- word – word form to look for
- pos – POS tag to look for
Returns: a corpus query string
>>> lemma_query = corpus_query(lemma="lopen") >>> df_corpus = create_corpus("zeebrieven").pattern(lemma_query).kwic() >>> display(df_corpus)