Utils¶
-
chaininglib.utils.dfops.
check_valid_df
(function_name, obj)[source]¶ This function is called by others to check if input is a DataFrame, when it is expected! If the input does not contain a DataFrame, throw an error
Parameters: - function_name – the name of the function, so as to be able to show where an error occured
- obj – the object to be checked
Returns: N/A
-
chaininglib.utils.dfops.
column_difference
(df_column1, df_column2)[source]¶ This function computes differences and similarities between two Pandas DataFrames
Parameters: - df_column1 – a Pandas DataFrame, filtered by one column
- df_column2 – a Pandas DataFrame, filtered by one column
Returns: array of words only in df_column1 diff_right: array of words only in df_column2 intersec: array of words both in df_column1 and df_column2
Return type: diff_left
>>> diff_left, diff_right, intersec = column_difference(df_corpus1["word 1"], df_corpus2["word 1"]) >>> display( 'These words are only in DataFrame #1 : ' + ", ".join(diff_left) ) >>> display( 'These words are only in DataFrame #2 : ' + ", ".join(diff_right) ) >>> display( 'These words are common to both DataFrame : ' + ", ".join(intersec) )
-
chaininglib.utils.dfops.
df_filter
(df_column, pattern, method='contains')[source]¶ Helper function to build some condition to filter a Pandas DataFrame, given a column and some value(s) to filter this column with
Parameters: - df_column – a Pandas DataFrame column to filter on
- pattern – string, set or interval list to filter on
- method – “contains”, “match”, isin” or “interval”
Returns: a condition
>>> words_ending_with_e = df_filter( df_lexicon["wordform"], 'e$' ) >>> df_lexicon_final_e = df_lexicon[ words_ending_with_e ]
-
chaininglib.utils.dfops.
get_rank_diff
(df1, df2, index=None, label1='rank_1', label2='rank_2')[source]¶ This function compares the rankings of words common to two dataframes, and compute a rank_diff, in such a way that one can see which words are very frequent in one set and rare in the other.
Parameters: - df1 – a Pandas DataFrame provided with rankings stored in a column “rank” (see example)
- df2 – a Pandas DataFrame provided with rankings stored in a column “rank” (see example)
- index (Optional) – name of the column to be used as index (usually: the lemmata column)
- label1 (Optional) – output column name for the ranks of the items of df1
- label2 (Optional) – output column name for the ranks of the items of df2
Returns: a Pandas DataFrame with lemmata (index), ranks of both input dataframes (label1 and label2) and the rank_diff (‘rank_diff’ column).
>>> df_frequency_list1 = get_frequency_list(corpus_to_search1) >>> df_frequency_list2 = get_frequency_list(corpus_to_search2) >>> df_rankdiffs = get_rank_diff(df_frequency_list1, df_frequency_list2)
-
chaininglib.utils.dfops.
get_relfreq_diff
(df1, df2, index=None, label1='relfreq_1', label2='relfreq_2', operation='division', N=1)[source]¶ This function compares the rankings of words common to two dataframes, and compute a rank_diff, in such a way that one can see which words are very frequent in one set and rare in the other.
Parameters: - df1 – a Pandas DataFrame provided with relative frequency stored in a column “perc” (see example)
- df2 – a Pandas DataFrame provided with relative frequency stored in a column “perc” (see example)
- index (Optional) – name of the column to be used as index (usually: the lemmata column)
- label1 (Optional) – output column name for the relative frequency of the items of df1
- label2 (Optional) – output column name for the relative frequency of the items of df2
- operation (optional) – ‘division’ for dividing relative frequencies by eachother, ‘subtraction’ for subtracting relative frequencies from eachother. Default ‘division’
- N (optional) – smoothing parameter when operation is ‘division’. Default 1.
Returns: a Pandas DataFrame with lemmata (index), ranks of both input dataframes (‘rank_1’ and ‘rank_2’ columns) and the rank_diff (‘rank_diff’ column).
>>> df_frequency_list1 = get_frequency_list(corpus_to_search1) >>> df_frequency_list2 = get_frequency_list(corpus_to_search2) >>> df_rankdiffs = get_rank_diff(df_frequency_list1, df_frequency_list2)
-
chaininglib.utils.dfops.
join_df
(df_arr, join_type=None)[source]¶ This function joins two dataframes (=concat along axis 1)
Parameters: - df_arr – array of Pandas DataFrames
- join_type – {inner, outer (default)}
Returns: a single Pandas DataFrame
>>> new_df = join_df( [dataframe1, dataframe2] ) >>> display_df(new_df)
-
chaininglib.utils.dfops.
property_freq
(df, column_name)[source]¶ Count values for a certain property in a results DataFrame, and sort them by frequency
Parameters: - df – DataFrame with results, one row per found token
- column_name – Column name (property) to count
Returns: a DataFrame of the most values for this property, sorted by frequency. Column ‘token count’ contains the number of tokens, column ‘perc’ gives the percentage.