Following is the proposed preliminary workflow for the data transformation project.
Let’s store them in a directory and try reading them without causing pain in the fingers or wear and tear on mouse and trackpad. First, let’s create a list of files of code lists.
filenames <- list.files("./contacts/CodeList", pattern="*.xlsx", full.names = T) # We can avoid creating a separate directory for code list. But organizing may be difficult. However, this can be explored further if we want transform all the data in one go i.e. not by functions (contacts, accounts etc.).
# File paths
print(filenames)## [1] "./contacts/CodeList/CodeList_Contact_International_Version.xlsx"
## [2] "./contacts/CodeList/CodeList_Contact_Is_Contact_Person_For.xlsx"
## [3] "./contacts/CodeList/CodeList_Contact_Personal_Addresses.xlsx"
## [4] "./contacts/CodeList/CodeList_Contact.xlsx"
Please ensure that there are no hidden files in the directory
Now, let’s attempt reading them
sheet_names<-lapply(filenames, excel_sheets) # Creates a list of the sheet names
for(i in seq_along(filenames)){
codelist_files<-lapply(excel_sheets(filenames[[i]]), read_excel, path = filenames[[i]]) # Reads the sheets of the excel files
names(codelist_files)<-c(sheet_names[[i]]) # Renames them according to the sheet names extracted above
# for(j in seq_along(sheet_names[[i]])){
# assign(paste0(substr(filenames[[i]],30,nchar(filenames[[i]])-5),"_",sheet_names[[i]][j]), read_excel(path=filenames[[i]], sheet = sheet_names[[i]][j]))
# }
}
# Names of the files imported
names(codelist_files)## [1] "Academic_Title" "Additional_Academic_Title"
## [3] "Best_Reached_By" "CountryRegion"
## [5] "State" "Contact_Permission"
## [7] "Department" "Function"
## [9] "Gender" "Language"
## [11] "Marital_Status" "Prefix"
## [13] "Perception_Of_Company" "Profession"
## [15] "Status" "Title"
## [17] "VIP_Contact"
Now we shall extract the templates. There are two templates for each file. One for SAP i.e. the file that needs to uploaded to SAP. And the other is the file that needs to be converted to the SAP template format.
We shall start with the legacy format. Since we do not have the real data, we have created a dummy. Right now, just one table of Contact. Some intentional errors have been introduced in the file.
Let us now extract the data. Below we are reading only one file having all data related to Contacts from the legacy system.
oldfilepath<-("./contacts/olddummy.xlsx")
old.data<-lapply(excel_sheets(oldfilepath), read_excel, path=oldfilepath)
names(old.data)<-excel_sheets(oldfilepath)
# Names of the files imported
names(old.data)## [1] "Contact_o" "Contact_International_Version_o"
## [3] "Contact_Is_Contact_Person_For_o" "Contact_Personal_Addresses_o"
## [5] "Contact_Notes_o"
We shall use the Contact_o i.e. old contact table and transform into the required SAP upload format after checking for possible errors.
The process to transform all the data from one segment (for e.g. Contacts or Accounts etc.) can implemented only if we guarantee error free data in the legacy system. Since that is not possible, we need to do it per table (per sheet of the excel file).
Now we shall create SAP template.
Although the file has multiple sheets, only the last sheet i.e. Field_Definitions holds enough information for us to create the template.
saptemplate<-read_excel("./contacts/Contact.xlsx", sheet = "Field_Definitions")
# First few rows of the imported data
head(saptemplate)## # A tibble: 6 × 8
## `Sheet Name` Header `Property Name` `UI Text` `Data Type` `Max Length`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Contact External_… ExternalKey External K… String 100
## 2 Contact Contact_ID ContactID Contact ID String 10
## 3 Contact Status StatusCode Status String 2
## 4 Contact Title TitleCode Title String 4
## 5 Contact Academic_… AcademicTitleCode Academic T… String 4
## 6 Contact Additiona… AdditionalAcadem… Additional… String 4
## # … with 2 more variables: CodeList File Path <chr>, Mandatory <chr>
Please note that the format of the tables (sheet) has been slightly changed. Earlier the corresponding sheet name was mentioned in a row before the actual table. Now, all the rows mention the corresponding sheet name. This was done manually for convenience of data extraction
Although we will be using only one table at the moment, all the templates have been exported below.
snames<-unique(saptemplate$`Sheet Name`)
# Creates data frame for each sheet in snames
for( i in seq_along(snames)){
colnames<-saptemplate[saptemplate$`Sheet Name`==snames[i],]$Header # Defines the column names
df<-read.table("", col.names = colnames) # Creates an empty data frame using the column names
assign(snames[i], df) # Assigns value of df to a data frame named in sname
}Steps to check the legacy data for errors and transform into SAP compatible format.
# Column names of the Contact table
colnames(Contact)## [1] "External_Key" "Contact_ID"
## [3] "Status" "Title"
## [5] "Academic_Title" "Additional_Academic_Title"
## [7] "Prefix" "First_Name"
## [9] "Last_Name" "Additional_Last_Name"
## [11] "Initials" "Middle_Name"
## [13] "Gender" "Marital_Status"
## [15] "Language" "Nick_Name"
## [17] "Date_of_Birth" "Birth_Name"
## [19] "Contact_Permission" "Profession"
## [21] "Perception_Of_Company" "Account_External_Key"
## [23] "Account_ID" "Building"
## [25] "Floor" "Room"
## [27] "Job_Title" "Function"
## [29] "Department" "Department_From_Business_Card"
## [31] "VIP_Contact" "Phone"
## [33] "Mobile" "Fax"
## [35] "EMail" "EMail_Invalid"
## [37] "Best_Reached_By" "CountryRegion"
## [39] "State_Text_Updatable" "House_Number"
## [41] "Street" "City"
## [43] "Postal_Code" "State"
## [45] "Contact_Owner_External_Key" "Contact_Owner_ID"
## [47] "Former_CRM_reference"
old.copy<-old.data$Contact_o # Selecting only one table as sample
old.copy## # A tibble: 33 × 47
## ID CID STATUS TTL AcaTTL AddtnalTTL pre name surname add_surname
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 98320 F2371 Active Mr. <NA> B.A. von <NA> qefb <NA>
## 2 98321 F2372 Inactive Ms. <NA> Prof. Dr. von … <NA> <NA> wvjnweg
## 3 98322 F2373 not known Miss <NA> Dr. van Jojq… uqheq asdvjn
## 4 98323 F2374 Active Mast… B.A. <NA> van … ajnv… <NA> <NA>
## 5 98324 F2375 Inactive Dr. Prof.… <NA> da jenwv kuhanb… ajvn
## 6 98325 F2376 not known <NA> Dr. <NA> de <NA> <NA> <NA>
## 7 98326 F2377 Active Mr. <NA> <NA> de la <NA> <NA> niebjnwe
## 8 98327 F2378 Inactive Mr. <NA> <NA> dos wjvn… wjnweg <NA>
## 9 98328 F2379 not known Mr. <NA> B.A. du <NA> <NA> vneiwg
## 10 98329 F2380 <NA> <NA> MBA Prof. Dr. el <NA> ejavneq jsdnw
## # … with 23 more rows, and 37 more variables: initi <chr>, mid_name <chr>,
## # gender <chr>, mar_sta <chr>, lang <chr>, nick_name <lgl>, dob <lgl>,
## # birth_name <lgl>, Contact_Permission <chr>, Profession <chr>,
## # Perception_Of_Company <chr>, Account_External_Key <chr>, Account_ID <chr>,
## # Building <dbl>, Floor <lgl>, Room <lgl>, Job_Title <chr>, Function <chr>,
## # Department <chr>, Department_From_Business_Card <chr>, VIP_Contact <chr>,
## # Phone <lgl>, Mobile <lgl>, Fax <lgl>, EMail <lgl>, EMail_Invalid <lgl>, …
To do this in a safe way there are two options
Manual typing is error prone and hence should be avoided at least in this code. For the time being, we have created a separate file. Although the other way is easier to maintain and recommended.
mapped<-read.csv("./contacts/contact_map.csv", sep=";")
x=NULL
for(i in 1:nrow(mapped)){
x[i] = mapped[mapped$oldkey==colnames(old.copy[i]),]$Header
}
colnames(old.copy)<-x # Changing column namesEssential rows
saptemplate[saptemplate$`Sheet Name`=="Contact",] |>
filter(Mandatory=="Yes") |>
pull(Header) -> essential.rows # List of mandatory columnsCheck if some rows have missing items for mandatory columns
essen.rows.table=read.table("", col.names = c("Item","Missing"))
for(i in seq_along(essential.rows)){
essen.rows.table[i,2]<-sum(is.na(old.copy[,essential.rows[i]]))
essen.rows.table[i,1]<-essential.rows[i]
} # Creates the table below
essen.rows.table## Item Missing
## 1 External_Key 1
## 2 Last_Name 15
Remove the rows with missing mandatory values
for(i in seq_along(essential.rows)){
old.copy<-old.copy[!is.na(old.copy[,essential.rows[i]]),]
} # Remove the rows with missing mandatory valuesCheck if code listed column data are from the codelist
codelistcols<-saptemplate[saptemplate$`Sheet Name`=="Contact",] |>
filter(!is.na(`CodeList File Path`)) |> pull(Header) # List of columns that have a codelist
codelisted.rows.table=read.table("", col.names = c("Item","Missing", "Not_from_code"))
for(i in seq_along(codelistcols)){
codelisted.rows.table[i,3]<-sum(!pull(old.copy[,codelistcols[i]],1) %in% c(pull(codelist_files[codelistcols[i]][[1]],Description),NA)) # Added NA else empty columns also get counted
codelisted.rows.table[i,2]<-sum(is.na(old.copy[,codelistcols[i]]))
codelisted.rows.table[i,1]<-codelistcols[i]
} # Creates the table below
codelisted.rows.table## Item Missing Not_from_code
## 1 Status 3 10
## 2 Title 11 2
## 3 Academic_Title 9 2
## 4 Additional_Academic_Title 8 0
## 5 Prefix 3 1
## 6 Gender 11 6
## 7 Marital_Status 13 4
## 8 Language 8 5
## 9 Contact_Permission 6 5
## 10 Profession 13 4
## 11 Perception_Of_Company 12 1
## 12 Function 1 1
## 13 Department 4 0
## 14 VIP_Contact 16 0
## 15 Best_Reached_By 11 0
## 16 CountryRegion 8 9
## 17 State 8 0
If values do not match, we empty the value
for(i in seq_along(codelistcols)){
old.copy[!pull(old.copy[,codelistcols[i]],1) %in% c(pull(codelist_files[codelistcols[i]][[1]],Description),NA),codelistcols[i]]<-NA
} # Removes the value in case of mismatchfor(i in seq_along(codelistcols)){
old.copy[,codelistcols[i]]<-
pull(codelist_files[codelistcols[i]][[1]],2)[match(pull(old.copy,codelistcols[i]), pull(codelist_files[codelistcols[i]][[1]],Description))]
} # Matches each column with the corresponding code list and returns the value###Fix column types
dtype<-saptemplate[saptemplate$`Sheet Name`=="Contact",]$`Data Type` # List of data types. Non Exhaustive ATM
for(i in 1:ncol(old.copy)){
if(dtype[i] == "String"){
old.copy[,i] <- as.character(pull(old.copy,i))
}
if(dtype[i] == "Boolean"){
old.copy[,i] <- as.logical(pull(old.copy,i))
}
if(dtype[i] == "DateTime"){
old.copy[,i] <- lubridate::ymd_hms(pull(old.copy,i))
}
if(dtype[i] == "Time"){
old.copy[,i] <- lubridate::hms(pull(old.copy,i))
} # This list will increase and also change based on input date and time formats
}max.length<-saptemplate[saptemplate$`Sheet Name`=="Contact",]$`Max Length` # List of max lengths mentioned
colclasses<-lapply(old.copy,class) # getting column classes
for(i in 1: ncol(old.copy)){
if(colclasses[[i]]=="character"){
old.copy[,i]<- ifelse(nchar(pull(old.copy,i))>max.length[i], substring(pull(old.copy,i),1,max.length[i]), pull(old.copy,i))
} # If string length is more than mentioned, trim it to the mentioned
}write.csv(old.copy, "Contact.csv",row.names=FALSE) # Saving CSV fileview the exported file
contacts.sap<-read.csv("Contact.csv")
datatable(contacts.sap,options = list(scrollX = TRUE))Viewing is for sample only. Larger files cannot be viewed in html, requires server side processing.
This document contains several explanations. And the workflow has been divided into steps. These steps may increase and decrease (by combining several steps), depending on the data quality and structure. For e.g. if the input datetime format is different in different tables, each table may require manual transformation, increasing the number of steps and complexity. If the data quality is good, it may even be possible to transform all the tables in a segment (contacts or accounts etc.) may be transformed in a single run. It will eventually be clear only after obtaining the input data. The code can further be adjusted for faster processing.