Utils

chaininglib.utils.dfops.check_valid_df(function_name, obj)[source]

This function is called by others to check if input is a DataFrame, when it is expected! If the input does not contain a DataFrame, throw an error

Parameters:
  • function_name – the name of the function, so as to be able to show where an error occured
  • obj – the object to be checked
Returns:

N/A

chaininglib.utils.dfops.column_difference(df_column1, df_column2)[source]

This function computes differences and similarities between two Pandas DataFrames

Parameters:
  • df_column1 – a Pandas DataFrame, filtered by one column
  • df_column2 – a Pandas DataFrame, filtered by one column
Returns:

array of words only in df_column1 diff_right: array of words only in df_column2 intersec: array of words both in df_column1 and df_column2

Return type:

diff_left

>>> diff_left, diff_right, intersec = column_difference(df_corpus1["word 1"], df_corpus2["word 1"])
>>> display( 'These words are only in DataFrame #1 : ' + ", ".join(diff_left) )
>>> display( 'These words are only in DataFrame #2 : ' + ", ".join(diff_right) )
>>> display( 'These words are common to both DataFrame : ' + ", ".join(intersec) )
chaininglib.utils.dfops.df_filter(df_column, pattern, method='contains')[source]

Helper function to build some condition to filter a Pandas DataFrame, given a column and some value(s) to filter this column with

Parameters:
  • df_column – a Pandas DataFrame column to filter on
  • pattern – string, set or interval list to filter on
  • method – “contains”, “match”, isin” or “interval”
Returns:

a condition

>>> words_ending_with_e = df_filter( df_lexicon["wordform"], 'e$' )
>>> df_lexicon_final_e = df_lexicon[ words_ending_with_e ]
chaininglib.utils.dfops.get_rank_diff(df1, df2, index=None, label1='rank_1', label2='rank_2')[source]

This function compares the rankings of words common to two dataframes, and compute a rank_diff, in such a way that one can see which words are very frequent in one set and rare in the other.

Parameters:
  • df1 – a Pandas DataFrame provided with rankings stored in a column “rank” (see example)
  • df2 – a Pandas DataFrame provided with rankings stored in a column “rank” (see example)
  • index (Optional) – name of the column to be used as index (usually: the lemmata column)
  • label1 (Optional) – output column name for the ranks of the items of df1
  • label2 (Optional) – output column name for the ranks of the items of df2
Returns:

a Pandas DataFrame with lemmata (index), ranks of both input dataframes (label1 and label2) and the rank_diff (‘rank_diff’ column).

>>> df_frequency_list1 = get_frequency_list(corpus_to_search1)
>>> df_frequency_list2 = get_frequency_list(corpus_to_search2)
>>> df_rankdiffs = get_rank_diff(df_frequency_list1, df_frequency_list2)
chaininglib.utils.dfops.get_relfreq_diff(df1, df2, index=None, label1='relfreq_1', label2='relfreq_2', operation='division', N=1)[source]

This function compares the rankings of words common to two dataframes, and compute a rank_diff, in such a way that one can see which words are very frequent in one set and rare in the other.

Parameters:
  • df1 – a Pandas DataFrame provided with relative frequency stored in a column “perc” (see example)
  • df2 – a Pandas DataFrame provided with relative frequency stored in a column “perc” (see example)
  • index (Optional) – name of the column to be used as index (usually: the lemmata column)
  • label1 (Optional) – output column name for the relative frequency of the items of df1
  • label2 (Optional) – output column name for the relative frequency of the items of df2
  • operation (optional) – ‘division’ for dividing relative frequencies by eachother, ‘subtraction’ for subtracting relative frequencies from eachother. Default ‘division’
  • N (optional) – smoothing parameter when operation is ‘division’. Default 1.
Returns:

a Pandas DataFrame with lemmata (index), ranks of both input dataframes (‘rank_1’ and ‘rank_2’ columns) and the rank_diff (‘rank_diff’ column).

>>> df_frequency_list1 = get_frequency_list(corpus_to_search1)
>>> df_frequency_list2 = get_frequency_list(corpus_to_search2)
>>> df_rankdiffs = get_rank_diff(df_frequency_list1, df_frequency_list2)
chaininglib.utils.dfops.join_df(df_arr, join_type=None)[source]

This function joins two dataframes (=concat along axis 1)

Parameters:
  • df_arr – array of Pandas DataFrames
  • join_type – {inner, outer (default)}
Returns:

a single Pandas DataFrame

>>> new_df = join_df( [dataframe1, dataframe2] )
>>> display_df(new_df)
chaininglib.utils.dfops.property_freq(df, column_name)[source]

Count values for a certain property in a results DataFrame, and sort them by frequency

Parameters:
  • df – DataFrame with results, one row per found token
  • column_name – Column name (property) to count
Returns:

a DataFrame of the most values for this property, sorted by frequency. Column ‘token count’ contains the number of tokens, column ‘perc’ gives the percentage.

chaininglib.utils.stringutils.containsRegex(word)[source]

This function checks whether some string contains a regular expression or not

Parameters:word – a string to check for regular expressions
Returns:A boolean