Temporary Variables
While dealing with data in real world, its noted that level of complexity is higher and it would be better if there is some temporary variable inside the pandas data frame itself where we can store the value of the intermediate result which will hep to easily debug results and at the same time not necessary to save its value while saving data frame to hard disk.
Advantage : Complex data logic can be spitted in to multiple temp variables Easy to debug
Comment From: TomAugspurger
Could you give an example? You're already allowed to add arbitrary attributes to a DataFrame, so you could add a dict and put stuff in there.
Comment From: lijose
This is a sample scenario which comes common in SCD (Slowly Changing Dimension). There are work around for it, but it would be better if there was some temporary variable.
Scenario:
1) Select a list of columns from source data frame
natural_keys = [col1, col2, ...]
2) Apply MD5 function after merging all columns as string and add to temp column the source dataframe (source_df.cdc)
3) Select same set of columns from target data frame
4) Apply MD5 function after merging all columns as string and add to temp column the target dataframe (target_df.cdc)
5) Compare the cdc column between source and target data frame and set a temp column to specify flag
like values matching : 'NC'( No change )
value not matching : 'U' (update)
value matching : 'NA' (Ignore)
target.cdc is null : 'I' (Insert)
In above scenario step 2, 4 and 5 needs to have a temp variable and no point in saving its value in hard disk.
I hope it clears everything.
Comment From: TomAugspurger
Personally this sounds like a complication to pandas data model, with little gain. We’d then have to define how these are handled in operations, expose them in the UI, ... doesn’t seem worthwhile when the alternative is to drop before writing.
Curious to hear what others think.
From: Lijo Jose notifications@github.com Sent: Tuesday, February 20, 2018 6:17:33 PM To: pandas-dev/pandas Cc: Tom Augspurger; Comment Subject: Re: [pandas-dev/pandas] Suggestion: Temporary Variable in Pandas Data Frame (#19801)
This is a sample scenario which comes common in SCD (Slowly Changing Dimension). There are work around for it, but it would be better if there was some temporary variable.
Scenario:
- Select a list of columns from source data frame natural_keys = [col1, col2, ...]
- Apply MD5 function after merging all columns as string and add to temp column the source dataframe (source_df.cdc)
- Select same set of columns from target data frame
- Apply MD5 function after merging all columns as string and add to temp column the target dataframe (target_df.cdc)
- Compare the cdc column between source and target data frame and set a temp column to specify flag like values matching : 'NC'( No change ) value not matching : 'U' (update) value matching : 'NA' (Ignore) target.cdc is null : 'I' (Insert)
In above scenario step 2, 4 and 5 needs to have a temp variable and no point in saving its value in hard disk.
I hope it clears everything.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/pandas-dev/pandas/issues/19801#issuecomment-367168352, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABQHIo86q6M1Ea1jbD86StqJUWZlkr_Iks5tW2CdgaJpZM4SMuz9.
Comment From: jreback
All of the above operations are easily accomplished by writing idiomatic code no special facilities are needed. if you have a specific example please show it.