Utils¶

chaininglib.utils.dfops.check_valid_df(function_name, obj)[source]¶

This function is called by others to check if input is a DataFrame, when it is expected! If the input does not contain a DataFrame, throw an error

Parameters:	function_name – the name of the function, so as to be able to show where an error occured obj – the object to be checked
Returns:	N/A

chaininglib.utils.dfops.column_difference(df_column1, df_column2)[source]¶

This function computes differences and similarities between two Pandas DataFrames

Parameters:	df_column1 – a Pandas DataFrame, filtered by one column df_column2 – a Pandas DataFrame, filtered by one column
Returns:	array of words only in df_column1 diff_right: array of words only in df_column2 intersec: array of words both in df_column1 and df_column2
Return type:	diff_left

>>> diff_left, diff_right, intersec = column_difference(df_corpus1["word 1"], df_corpus2["word 1"])
>>> display( 'These words are only in DataFrame #1 : ' + ", ".join(diff_left) )
>>> display( 'These words are only in DataFrame #2 : ' + ", ".join(diff_right) )
>>> display( 'These words are common to both DataFrame : ' + ", ".join(intersec) )

chaininglib.utils.dfops.df_filter(df_column, pattern, method='contains')[source]¶

Helper function to build some condition to filter a Pandas DataFrame, given a column and some value(s) to filter this column with

Parameters:	df_column – a Pandas DataFrame column to filter on pattern – string, set or interval list to filter on method – “contains”, “match”, isin” or “interval”
Returns:	a condition

>>> words_ending_with_e = df_filter( df_lexicon["wordform"], 'e$' )
>>> df_lexicon_final_e = df_lexicon[ words_ending_with_e ]

chaininglib.utils.dfops.get_rank_diff(df1, df2, index=None, label1='rank_1', label2='rank_2')[source]¶

This function compares the rankings of words common to two dataframes, and compute a rank_diff, in such a way that one can see which words are very frequent in one set and rare in the other.

Parameters:

Parameters:	df1 – a Pandas DataFrame provided with rankings stored in a column “rank” (see example) df2 – a Pandas DataFrame provided with rankings stored in a column “rank” (see example) index (Optional) – name of the column to be used as index (usually: the lemmata column) label1 (Optional) – output column name for the ranks of the items of df1 label2 (Optional) – output column name for the ranks of the items of df2
Returns:	a Pandas DataFrame with lemmata (index), ranks of both input dataframes (label1 and label2) and the rank_diff (‘rank_diff’ column).

df1 – a Pandas DataFrame provided with rankings stored in a column “rank” (see example)
df2 – a Pandas DataFrame provided with rankings stored in a column “rank” (see example)
index (Optional) – name of the column to be used as index (usually: the lemmata column)
label1 (Optional) – output column name for the ranks of the items of df1
label2 (Optional) – output column name for the ranks of the items of df2

Returns:

a Pandas DataFrame with lemmata (index), ranks of both input dataframes (label1 and label2) and the rank_diff (‘rank_diff’ column).

>>> df_frequency_list1 = get_frequency_list(corpus_to_search1)
>>> df_frequency_list2 = get_frequency_list(corpus_to_search2)
>>> df_rankdiffs = get_rank_diff(df_frequency_list1, df_frequency_list2)

chaininglib.utils.dfops.get_relfreq_diff(df1, df2, index=None, label1='relfreq_1', label2='relfreq_2', operation='division', N=1)[source]¶

This function compares the rankings of words common to two dataframes, and compute a rank_diff, in such a way that one can see which words are very frequent in one set and rare in the other.

Parameters:

Parameters:	df1 – a Pandas DataFrame provided with relative frequency stored in a column “perc” (see example) df2 – a Pandas DataFrame provided with relative frequency stored in a column “perc” (see example) index (Optional) – name of the column to be used as index (usually: the lemmata column) label1 (Optional) – output column name for the relative frequency of the items of df1 label2 (Optional) – output column name for the relative frequency of the items of df2 operation (optional) – ‘division’ for dividing relative frequencies by eachother, ‘subtraction’ for subtracting relative frequencies from eachother. Default ‘division’ N (optional) – smoothing parameter when operation is ‘division’. Default 1.
Returns:	a Pandas DataFrame with lemmata (index), ranks of both input dataframes (‘rank_1’ and ‘rank_2’ columns) and the rank_diff (‘rank_diff’ column).

df1 – a Pandas DataFrame provided with relative frequency stored in a column “perc” (see example)
df2 – a Pandas DataFrame provided with relative frequency stored in a column “perc” (see example)
index (Optional) – name of the column to be used as index (usually: the lemmata column)
label1 (Optional) – output column name for the relative frequency of the items of df1
label2 (Optional) – output column name for the relative frequency of the items of df2
operation (optional) – ‘division’ for dividing relative frequencies by eachother, ‘subtraction’ for subtracting relative frequencies from eachother. Default ‘division’
N (optional) – smoothing parameter when operation is ‘division’. Default 1.

Returns:

a Pandas DataFrame with lemmata (index), ranks of both input dataframes (‘rank_1’ and ‘rank_2’ columns) and the rank_diff (‘rank_diff’ column).

>>> df_frequency_list1 = get_frequency_list(corpus_to_search1)
>>> df_frequency_list2 = get_frequency_list(corpus_to_search2)
>>> df_rankdiffs = get_rank_diff(df_frequency_list1, df_frequency_list2)

chaininglib.utils.dfops.join_df(df_arr, join_type=None)[source]¶

This function joins two dataframes (=concat along axis 1)

Parameters:	df_arr – array of Pandas DataFrames join_type – {inner, outer (default)}
Returns:	a single Pandas DataFrame

>>> new_df = join_df( [dataframe1, dataframe2] )
>>> display_df(new_df)

chaininglib.utils.dfops.property_freq(df, column_name)[source]¶

Count values for a certain property in a results DataFrame, and sort them by frequency

Parameters:	df – DataFrame with results, one row per found token column_name – Column name (property) to count
Returns:	a DataFrame of the most values for this property, sorted by frequency. Column ‘token count’ contains the number of tokens, column ‘perc’ gives the percentage.

chaininglib.utils.stringutils.containsRegex(word)[source]¶

This function checks whether some string contains a regular expression or not

Parameters:	word – a string to check for regular expressions
Returns:	A boolean