Process

chaininglib.process.corpus.extract_lexicon(dfs_corpus, lemmaColumnName='lemma', posColumnName='pos', wordformColumnName='word')[source]

This method creates a lexicon from a list of corpus search results. Lemma, POS and word column names from the corpus results are also used for the resulting lexicon

Parameters:
  • dfs_corpus – list of Pandas DataFrames with search results from different corpora
  • lemmaColumnName – (default ‘lemma’) column name for lemma in dfs_corpus
  • posColumnName – (default ‘pos’) column name for part-of-speech in dfs_corpus
  • wordformColumnName – (default ‘word’) column name for word form in dfs_corpus
Returns:

a Pandas DataFrame representing a lexicon, with lemmaColumnName, posColumnName and wordformColumnName as columns

>>> dfs_corpus = [df_results_corpus1, df_results_corpus2]
>>> lexicon = extract_lexicon(dfs_corpus, lemmaColumnName='lemma', posColumnName='pos', wordformColumnName='word')
chaininglib.process.corpus.get_frequency_list(df_corpus, column_name='lemma')[source]

This function computes the raw frequency of lemmata in a DataFrame containing corpus data

Parameters:
  • df_corpus – a Pandas DataFrame with corpus data (it must contain at least one ‘lemma’ column)
  • column_name – the column name (default ‘lemma’) containing the items of which we are computing frequencies
Returns:

a Pandas DataFrame with ‘lemmata’ as index, ‘token count’ a number of occurences per lemma, and ‘rank’ as ordinal position in the list of lemmata, based on the ‘token count’.

>>> df_corpus = create_corpus("gysseling").lemma("boef").search().kwic()
>>> df_freq_list = get_frequency_list(df_corpus)
chaininglib.process.corpus.get_tagger(dfs_corpus, word_key='word', pos_key='universal_dependency')[source]

This function instantiates a tagger trained with some corpus annotations (out of a DataFrame)

Parameters:
  • dfs_corpus – one (or a list of) Pandas DataFrame(s) with annotated corpus data
  • word_key – (default ‘word’) column name for wordforms in dfs_corpus
  • pos_key – (default ‘universal_dependency’) column name for parts-of-speech in dfs_corpus
Returns:

a PerceptronTagger instance

>>> # get a tagger, trained with df_corpus: a Pandas DataFrame with lots of corpus data
>>> tagger = get_tagger(df_corpus)
>>> # tag a sentence now
>>> sentence = 'Here is some beautiful sentence'
>>> tagged_sentence = tagger.tag( sentence.split() )
>>> print(tagged_sentence)
chaininglib.process.lexicon.get_diamant_synonyms(df)[source]

This function returns a set of synonyms for a lemma from a DiaMaNT result DataFrame. This is done by taking the definition text for entries which have been found by word form, and by taking the lemma for entries which have been found by definition text.

Parameters:df – a Pandas DataFrame containing Diamant data
Returns:a set of synonyms
>>> lq = create_lexicon(lexicon).word(search_word).search()
>>> df_lexicon = lq.kwic()
>>> syns = diamant_get_synonyms(df_lexicon)
>>> display( 'Synoniemen voor ' + search_word + ': ' + ", ".join(syns)))