This commit is contained in:
2022-01-14 19:16:43 +01:00
parent 2d425a9129
commit 4e5054664a
322 changed files with 125109 additions and 0 deletions

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

365
Accounts.Rmd Normal file
View File

@@ -0,0 +1,365 @@
---
title: "Accounts"
author: "Scary Scarecrow"
date: "1/10/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(readxl)
library(dplyr)
library(lubridate)
library(DT)
library(tidyr)
mutlstxlrdr<-function(){
for( i in seq_along(sheet.na)){
colnames<-unique(saptemplate[saptemplate$`Sheet Name`==snames[i],]$Header)
df<-read.table("", col.names = colnames)
assign(snames[i], df)
}
}
```
## Data transformation workflow
Following is the proposed preliminary workflow for the data transformation project.
>All file of a segment (contacts/accounts etc..) should be inside the relevant folder. Each folder should have one folder for all codelist files. All legacy data (one file for each country) should be inside the raw-data folder, named after each country. Another file having field definitions including name of the matching column from the legacy file should also be there.
>*Make sure that there are no hidden files inside the directory.*
### Code Lists
```{r Create List of Files, echo=TRUE, message=FALSE, warning=FALSE}
filenames <- list.files("./accounts/CodeList", pattern="*.xlsx", full.names = T) # We can avoid creating a separate directory for code list. But organizing may be difficult. However, this can be explored further if we want transform all the data in one go i.e. not by functions (contacts, accounts etc.).
# File paths
print(filenames)
```
Check manually if the above list includes all the codelist files
If correct, then read the files.
```{r codelistreader, echo=TRUE, message=FALSE, warning=FALSE}
sheet_names<-lapply(filenames, excel_sheets) # Creates a list of the sheet names
codelist_files<-NULL
for(i in seq_along(filenames)){
a<-lapply(excel_sheets(filenames[[i]]), read_excel, path = filenames[[i]], col_types = "text") # Reads the sheets of the excel files
names(a)<-c(sheet_names[[i]]) # Renames them according to the sheet names extracted above
codelist_files<-c(codelist_files,a)
}
# Names of the files imported
names(codelist_files)
#codelist_files<-unique(codelist_files)
codelist_files$Customer_type_I
```
### Templates
Let us now extract the data. Below we are reading only one file having all data related to `Contacts` from the legacy system.
```{r readlegacyfilepath, echo=TRUE, message=FALSE, warning=FALSE}
oldfilepath<-list.files("./accounts/raw-data", pattern="*.xls", full.names = T) # Change the path, check pattern
print(oldfilepath)
```
Check it the list matches the actual files, manually.
```{r readlegacyfiles, echo=TRUE}
old_files<-NULL
#read_excel(path = oldfilepath[[i]], sheet = 1)
for(i in seq_along(oldfilepath)){
old_files[[i]]<-read_excel(path = oldfilepath[[i]], sheet = 1)
}
old_files
names(old_files)<-gsub("./accounts/raw-data/","",oldfilepath) # Change path
```
*Some errors in the legacy file noticed. Columns with similar or same name exists.*
```{r readSAPtemplate, echo=TRUE, message=FALSE, warning=FALSE}
saptemplate<-read_excel("./accounts/template.xlsx", sheet = "Field_Definitions")
# First few rows of the imported data
head(saptemplate)
```
*Please note that the format of the tables (sheet) has been slightly changed. Earlier the corresponding sheet name was mentioned in a row before the actual table. Now, all the rows mention the corresponding sheet name. This was done manually for convenience of data extraction*
## Don't have Status column defined
## There could be issue in line of business
```{r createmptySAPfiles, echo=TRUE, message=FALSE, warning=FALSE}
#orilo<-"en_US.UTF-8"
#Sys.setlocale(locale="en_US.UTF-8")
strt<-Sys.time()
snames <- unique(saptemplate$`Sheet Name`)
for (h in seq_along(old_files)) {
# Copy original data
old.copy <- old_files[[h]]
print(paste0(names(old_files[h])," imported"))
err.summ<-data.frame(Country=NULL, Name=NULL, Expected=NULL, Actual=NULL) #Error Cal
# Creates data frame for each sheet in snames
for (i in seq_along(snames)) {
print(paste0("Processing ..",snames[i]))
# Select the column names from the field description sheet
print("Creating template")
sel.template.desc <-
saptemplate[saptemplate$`Sheet Name` == snames[i], ]
print("Creating column names")
sel.template.desc.colnames <- sel.template.desc$Header
# Create a list by adding values from corresponding legacy data
temp <- NULL
print("adding values to template ")
for (j in seq_along(sel.template.desc.colnames)) {
temp[j] <-ifelse(sel.template.desc$oldkey[j]=="NA" | is.na(sel.template.desc$oldkey[j]),
NA,as.vector(old.copy[, sel.template.desc$oldkey[j]])
)
}
# Rename the columns according to field description
print("renaming template ")
names(temp) <- sel.template.desc.colnames
# Create data frame from the list
df <- as.data.frame(temp)
print("Converted to data frame")
# Error summary file
Expected<-nrow(df)
#Select essential rows
print("Identifying essential rows")
sel.template.desc |>
filter(Mandatory == "Yes") |>
pull(Header) -> essential.columns
error.mandatory <- NULL
error.df<-data.frame(Country=NULL, Name=NULL, Rows=NULL, Expected=NULL)
# Operate on essential columns including creation of error file
for (k in seq_along(essential.columns)) { # In case there are any default values (of mandatory) they need to be added here
if(essential.columns[k]=="International_Version"){
print("Found International Version. Adding 0.")
df$International_Version<-"0"
}
print("Creating and writing data with missing mandatory values")
assign(
paste0(
"error_mandatory_",
substr(names(old_files[h]), 2, 3),
"_",
snames[i],
"_",
essential.columns[k]
),
df[is.na(df[, essential.columns[k]]), ]
)
# TO be saved in error files
if(nrow(df[is.na(df[, essential.columns[k]]), ])>0){
write.csv(
df[is.na(df[, essential.columns[k]]), ],
paste0(
"./acounts/errors/mandatory/", #Change path
substr(names(old_files[h]), 2, 3),
"_",
snames[i],
"_",
essential.columns[k],
"_error_mandatory.csv"
), row.names = F, na=""
)
}
# Error summary file
Country<-substr(names(old_files[h]), 2, 3)
Name<-snames[i]
err.type<-paste0("Missing ",essential.columns[k])
err.count<-nrow(df[is.na(df[, essential.columns[k]]), ])
print("Removing rows with empty essetial columns")
df <- df[!is.na(df[, essential.columns[k]]), ]
if(err.count>0){
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal
}
}
print("Identifying columns associated with codelists")
# List of columns that have a codelist
codelistcols <- sel.template.desc |>
filter(!is.na(`CodeList File Path`)) |> pull(Header)
for (k in seq_along(codelistcols)) {
print(paste0("Identifying errors ",codelistcols[k]))
def.rows <-
which(!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA))
def.n<- df[def.rows, 1]
def.rows.val <-
df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]]
def <- data.frame(def.rows, def.n,def.rows.val)
if(nrow(def>0)){
assign(paste0(
"error_codematch_",
substr(names(old_files[1]), 1, 2),
"_",
snames[i],
"_",
codelistcols[k]
),
def) # TO be saved
write.csv(
def,
paste0(
"./accounts/errors/codelist/", #Change path
substr(names(old_files[h]), 2, 3),
"_",
snames[i],
"_",
codelistcols[k],
"_error_codematch_.csv"
), row.names = F, na=""
)
}
err.type<-paste0("Codelist Mismatch ", codelistcols[k]) #Error cal
err.count<-nrow(def) #Error cal
if(err.count>0){
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal
}
print(paste0("Removing errors ",codelistcols[k]))
# Removes any mismatch
df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]] <-
NA
# Matches each column with the corresponding code list and returns the value
df[, codelistcols[k]] <-
pull(codelist_files[codelistcols[k]][[1]], 2)[match(pull(df, codelistcols[k]),
pull(codelist_files[codelistcols[k]][[1]], Description))]
}
max.length <- as.numeric(sel.template.desc$`Max Length`)
dtype <- sel.template.desc$`Data Type`
rowval <- NULL
ival <- NULL
rval <- NULL
lenght.issue.df <- NULL
# Changing the data class
for (k in 1:ncol(df)) {
if (dtype[k] == "String") {
df[, k] <- as.character(pull(df, k))
}
if (dtype[k] == "Boolean") {
df[, k] <- as.logical(pull(df, k))
}
if (dtype[k] == "DateTime") {
df[, k] <- lubridate::ymd_hms(pull(df, k))
}
if (dtype[k] == "Time") {
df[, k] <- lubridate::hms(pull(df, k))
} # This list will increase and also change based on input date and time formats
}
print("Rectifying streetname")
# Street and House Number
if (any(colnames(df) == "Street")) {
df$Streetname<-NA
df$HouseNumber<-NA
# Separates streetname and housenumber
extract(df,
"Street",
c("Streetname", "HouseNumber"),
"(\\D+)(\\d.*)")
df <- df |>
select(-c("Street", "House_Number")) |>
rename(Street = Streetname, House_Number = HouseNumber) |>
select(sel.template.desc.colnames)
}
# Length Rectification
colclasses <- lapply(df, class)
print("Rectifying Length")
for (k in 1:ncol(df)) {
if (colclasses[[k]] == "character") {
print("found character column ")
rowval <- pull(df, 1)
ival <- ifelse(nchar(pull(df, k))== 0 | is.na(nchar(pull(df, k))),1,nchar(pull(df, k)))
rval <- max.length[k]
# rectifying data length
df[, k] <-
ifelse(nchar(pull(df, k)) > max.length[k],
substring(pull(df, k), 1, max.length[k]),
pull(df, k))
}
lenght.issue.df <-
rbind(lenght.issue.df, data.frame(rowval, ival, rval))
err.type<- paste0("Length error ", colnames(df)[k]) # Error cal
err.count<- sum(ival>rval, na.rm = T) # Error cal
if(err.count>0){
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal
}
}
lenght.issue.df <- dplyr::filter(lenght.issue.df,ival>rval)
if(nrow(lenght.issue.df)>0){
write.csv(lenght.issue.df,
paste0(
"./accounts/errors/length/", # Change path
substr(names(old_files[h]), 2, 3),
"_",
snames[i],
"_length_error.csv"
), row.names = F, na="")
}
assign(snames[i], df)
write.csv(df,paste0("./acounts/output/", substr(names(old_files[h]), 2, 3), "_", snames[i],".csv"), row.names = F, na="") #Chnage path
if(nrow(error.df)>0){
write.csv(error.df, paste0("./contacts/summary/",substr(names(old_files[h]), 2, 3), "_", snames[i],"_error",".csv"), row.names = F, na="") # Error write
}
err.summ<-rbind(err.summ,data.frame(Country=Country, Name=Name, Expected=Expected, Actual=nrow(df))) #Error Cal
}
write.csv(err.summ,
paste0("./contacts/summary/" ,substr(names(old_files[h]), 2, 3), "_", snames[i],"_sumerror",".csv"), row.names = F, na="") # Error Write
}
end<-Sys.time()
end-strt
```
*The code failed because Department Column appears several times in the data and while importing R renamed them to Department..xx).*
*Manually verify if these are the required templates*

18
Contact.csv Normal file
View File

@@ -0,0 +1,18 @@
"External_Key","Contact_ID","Status","Title","Academic_Title","Additional_Academic_Title","Prefix","First_Name","Last_Name","Additional_Last_Name","Initials","Middle_Name","Gender","Marital_Status","Language","Nick_Name","Date_of_Birth","Birth_Name","Contact_Permission","Profession","Perception_Of_Company","Account_External_Key","Account_ID","Building","Floor","Room","Job_Title","Function","Department","Department_From_Business_Card","VIP_Contact","Phone","Mobile","Fax","EMail","EMail_Invalid","Best_Reached_By","CountryRegion","Street","City","Postal_Code","State","Contact_Owner_External_Key","Contact_Owner_ID","Former_CRM_reference","House_Number","State_Text_Updatable"
"98320","F2371","2","0002",NA,"0004","0001",NA,"qefb",NA,"D",NA,NA,NA,NA,NA,NA,NA,"1",NA,"01","nnfknwei","njljenf","1",NA,NA,NA,"0001","0001",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"98322","F2373",NA,NA,NA,"0001","0003","Jojqfn","uqheq","asdvjn",NA,NA,NA,NA,"DE",NA,NA,NA,"3",NA,"03",NA,NA,"3",NA,NA,NA,"0003","0003",NA,NA,NA,NA,NA,NA,NA,"INT",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"98324","F2375",NA,NA,"0003",NA,"0005","jenwv","kuhanbbw","ajvn",NA,"qjebofb",NA,NA,"ES",NA,NA,NA,NA,NA,NA,NA,NA,"5",NA,NA,"wevne","0005","0005",NA,NA,NA,NA,NA,NA,NA,"TEL",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"98327","F2378",NA,"0002",NA,NA,"0008","wjvnjwnef","wjnweg",NA,"I","sjdvnw",NA,NA,"NL",NA,NA,NA,NA,NA,"02",NA,NA,"8",NA,NA,"aeb","0008","0008",NA,"C",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"BGL",NA,NA,NA,NA,NA
"98329","F2380",NA,NA,"0005","0003","0010",NA,"ejavneq","jsdnw","J","wienw",NA,NA,"ZH",NA,NA,NA,NA,NA,"01",NA,"wejgnkjlqe","10",NA,NA,NA,"0010","0010",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"FRA",NA,NA,NA,NA,NA
"98332","F2383",NA,NA,NA,NA,"0013",NA,"jviwef",NA,"NU","wjbwv",NA,NA,NA,NA,NA,NA,"2",NA,NA,"wejnfwjg","weignwgw","13",NA,NA,"ertbgewb","0013","0013",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"GHO",NA,NA,NA,NA,NA
"98333","F2384",NA,NA,NA,NA,"0014","qwejfnv","jnbwon","wsebhjuw","IE","wjgniwg",NA,NA,NA,NA,NA,NA,"3",NA,NA,NA,NA,"14",NA,NA,NA,"0014","0014",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"HEL",NA,NA,NA,NA,NA
"98336","F2387",NA,NA,"0001","0003","0017","qejfjv","wjbnjnw","wejbwe","J","wehbwef",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"weojgqwegn","17",NA,NA,NA,"0017","0017",NA,NA,NA,NA,NA,NA,NA,"FAX",NA,NA,NA,NA,"KAB",NA,NA,NA,NA,NA
"98337","F2388",NA,NA,NA,"0001","0018",NA,"svnjwne",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"18",NA,NA,"wfbefb","0018","0018",NA,NA,NA,NA,NA,NA,NA,"INT",NA,NA,NA,NA,"KAN",NA,NA,NA,NA,NA
"98338","F2389","2",NA,NA,NA,"0019","qsvjbj","ijwegno","hwegbjwe","J","wejbiwq",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"19",NA,NA,NA,"0019","0019",NA,NA,NA,NA,NA,NA,NA,"LET",NA,NA,NA,NA,"KAP",NA,NA,NA,NA,NA
"98340","F2391",NA,NA,"0005",NA,"0021","kavjbjleq","dnbw","wejbwe",NA,"wejnw",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"21",NA,NA,NA,"0021","0021",NA,NA,NA,NA,NA,NA,NA,"VIS",NA,NA,NA,NA,"KHO",NA,NA,NA,NA,NA
"98341","F2392","2",NA,"0006",NA,NA,NA,"sjenw","wejfbiwef","JJ",NA,NA,NA,NA,NA,NA,NA,"1",NA,NA,NA,NA,"22",NA,NA,NA,"0022","0022",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"KNR",NA,NA,NA,NA,NA
"98343","F2394",NA,"0001",NA,"0003","0024","asjvnef","sefnjwe",NA,"JEI","wejnet",NA,NA,NA,NA,NA,NA,"3",NA,NA,NA,"ergnerg","24",NA,NA,NA,"0024","0024",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"98344","F2395","2",NA,NA,"0001","0025",NA,"wejbwee","wejhbwef",NA,"wjgb",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"25",NA,NA,NA,"0025",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"98346","F2397",NA,NA,"0003",NA,NA,NA,"jevwbi","wejbubvw",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"27",NA,NA,NA,"0027",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"98349","F2400",NA,"0001",NA,"0003",NA,NA,"asvbwe","wefjnbwe",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"98350","F2401",NA,NA,NA,"0001",NA,NA,"jasbv",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
1 External_Key Contact_ID Status Title Academic_Title Additional_Academic_Title Prefix First_Name Last_Name Additional_Last_Name Initials Middle_Name Gender Marital_Status Language Nick_Name Date_of_Birth Birth_Name Contact_Permission Profession Perception_Of_Company Account_External_Key Account_ID Building Floor Room Job_Title Function Department Department_From_Business_Card VIP_Contact Phone Mobile Fax EMail EMail_Invalid Best_Reached_By CountryRegion Street City Postal_Code State Contact_Owner_External_Key Contact_Owner_ID Former_CRM_reference House_Number State_Text_Updatable
2 98320 F2371 2 0002 NA 0004 0001 NA qefb NA D NA NA NA NA NA NA NA 1 NA 01 nnfknwei njljenf 1 NA NA NA 0001 0001 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 98322 F2373 NA NA NA 0001 0003 Jojqfn uqheq asdvjn NA NA NA NA DE NA NA NA 3 NA 03 NA NA 3 NA NA NA 0003 0003 NA NA NA NA NA NA NA INT NA NA NA NA NA NA NA NA NA NA
4 98324 F2375 NA NA 0003 NA 0005 jenwv kuhanbbw ajvn NA qjebofb NA NA ES NA NA NA NA NA NA NA NA 5 NA NA wevne 0005 0005 NA NA NA NA NA NA NA TEL NA NA NA NA NA NA NA NA NA NA
5 98327 F2378 NA 0002 NA NA 0008 wjvnjwnef wjnweg NA I sjdvnw NA NA NL NA NA NA NA NA 02 NA NA 8 NA NA aeb 0008 0008 NA C NA NA NA NA NA NA NA NA NA NA BGL NA NA NA NA NA
6 98329 F2380 NA NA 0005 0003 0010 NA ejavneq jsdnw J wienw NA NA ZH NA NA NA NA NA 01 NA wejgnkjlqe 10 NA NA NA 0010 0010 NA NA NA NA NA NA NA NA NA NA NA NA FRA NA NA NA NA NA
7 98332 F2383 NA NA NA NA 0013 NA jviwef NA NU wjbwv NA NA NA NA NA NA 2 NA NA wejnfwjg weignwgw 13 NA NA ertbgewb 0013 0013 NA NA NA NA NA NA NA NA NA NA NA NA GHO NA NA NA NA NA
8 98333 F2384 NA NA NA NA 0014 qwejfnv jnbwon wsebhjuw IE wjgniwg NA NA NA NA NA NA 3 NA NA NA NA 14 NA NA NA 0014 0014 NA NA NA NA NA NA NA NA NA NA NA NA HEL NA NA NA NA NA
9 98336 F2387 NA NA 0001 0003 0017 qejfjv wjbnjnw wejbwe J wehbwef NA NA NA NA NA NA NA NA NA NA weojgqwegn 17 NA NA NA 0017 0017 NA NA NA NA NA NA NA FAX NA NA NA NA KAB NA NA NA NA NA
10 98337 F2388 NA NA NA 0001 0018 NA svnjwne NA NA NA NA NA NA NA NA NA NA NA NA NA NA 18 NA NA wfbefb 0018 0018 NA NA NA NA NA NA NA INT NA NA NA NA KAN NA NA NA NA NA
11 98338 F2389 2 NA NA NA 0019 qsvjbj ijwegno hwegbjwe J wejbiwq NA NA NA NA NA NA NA NA NA NA NA 19 NA NA NA 0019 0019 NA NA NA NA NA NA NA LET NA NA NA NA KAP NA NA NA NA NA
12 98340 F2391 NA NA 0005 NA 0021 kavjbjleq dnbw wejbwe NA wejnw NA NA NA NA NA NA NA NA NA NA NA 21 NA NA NA 0021 0021 NA NA NA NA NA NA NA VIS NA NA NA NA KHO NA NA NA NA NA
13 98341 F2392 2 NA 0006 NA NA NA sjenw wejfbiwef JJ NA NA NA NA NA NA NA 1 NA NA NA NA 22 NA NA NA 0022 0022 NA NA NA NA NA NA NA NA NA NA NA NA KNR NA NA NA NA NA
14 98343 F2394 NA 0001 NA 0003 0024 asjvnef sefnjwe NA JEI wejnet NA NA NA NA NA NA 3 NA NA NA ergnerg 24 NA NA NA 0024 0024 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
15 98344 F2395 2 NA NA 0001 0025 NA wejbwee wejhbwef NA wjgb NA NA NA NA NA NA NA NA NA NA NA 25 NA NA NA 0025 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
16 98346 F2397 NA NA 0003 NA NA NA jevwbi wejbubvw NA NA NA NA NA NA NA NA NA NA NA NA NA 27 NA NA NA 0027 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
17 98349 F2400 NA 0001 NA 0003 NA NA asvbwe wefjnbwe NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
18 98350 F2401 NA NA NA 0001 NA NA jasbv NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

402
Contacts.Rmd Normal file
View File

@@ -0,0 +1,402 @@
---
title: "Contacts"
author: "Scary Scarecrow"
date: "12/27/2021"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(readxl)
library(dplyr)
library(lubridate)
library(DT)
library(tidyr)
mutlstxlrdr<-function(){
for( i in seq_along(sheet.na)){
colnames<-unique(saptemplate[saptemplate$`Sheet Name`==snames[i],]$Header)
df<-read.table("", col.names = colnames)
assign(snames[i], df)
}
}
```
## Data transformation workflow
Following is the proposed preliminary workflow for the data transformation project.
>All file of a segment (contacts/accounts etc..) should be inside the relevant folder. Each folder should have one folder for all codelist files. All legacy data (one file for each country) should be inside the raw-data folder, named after each country. Another file having field definitions including name of the matching column from the legacy file should also be there.
>*Make sure that there are no hidden files inside the directory.*
### Code Lists
```{r Create List of Files, echo=TRUE, message=FALSE, warning=FALSE}
filenames <- list.files("./contacts/CodeList", pattern="*.xlsx", full.names = T) # We can avoid creating a separate directory for code list. But organizing may be difficult. However, this can be explored further if we want transform all the data in one go i.e. not by functions (contacts, accounts etc.).
# File paths
print(filenames)
```
Check manually if the above list includes all the codelist files
If correct, then read the files.
```{r codelistreader, echo=TRUE, message=FALSE, warning=FALSE}
sheet_names<-lapply(filenames, excel_sheets) # Creates a list of the sheet names
codelist_files<-NULL
for(i in seq_along(filenames)){
a<-lapply(excel_sheets(filenames[[i]]), read_excel, path = filenames[[i]], col_types = "text") # Reads the sheets of the excel files
names(a)<-c(sheet_names[[i]]) # Renames them according to the sheet names extracted above
codelist_files<-c(codelist_files,a)
}
# Names of the files imported
names(codelist_files)
#codelist_files<-unique(codelist_files)
codelist_files$Academic_Title
```
### Templates
Let us now extract the data. Below we are reading only one file having all data related to `Contacts` from the legacy system.
```{r readlegacyfilepath, echo=TRUE, message=FALSE, warning=FALSE}
oldfilepath <- list.files("./contacts/raw-data/", pattern="*.xlsx", full.names = T)
print(oldfilepath)
```
Check it the list matches the actual files, manually.
```{r readlegacyfiles, echo=TRUE}
old_files<-NULL
#read_excel(path = oldfilepath[[i]], sheet = 1)
for(i in seq_along(oldfilepath)){
old_files[[i]]<-read_excel(path = oldfilepath[[i]], sheet = 1)
}
names(old_files)<-gsub("./contacts/raw-data/","",oldfilepath)
```
*Some errors in the legacy file noticed. Columns with similar or same name exists.*
```{r readSAPtemplate, echo=TRUE, message=FALSE, warning=FALSE}
saptemplate<-read_excel("./contacts/template.xlsx", sheet = "Field_Definitions")
# First few rows of the imported data
head(saptemplate)
```
*Please note that the format of the tables (sheet) has been slightly changed. Earlier the corresponding sheet name was mentioned in a row before the actual table. Now, all the rows mention the corresponding sheet name. This was done manually for convenience of data extraction*
```{r createmptySAPfiles, echo=TRUE, message=FALSE, warning=FALSE}
#orilo<-"en_US.UTF-8"
#Sys.setlocale(locale="en_US.UTF-8")
strt<-Sys.time()
snames <- unique(saptemplate$`Sheet Name`)
for (h in seq_along(old_files)) {
# Copy original data
old.copy <- old_files[[h]]
print(paste0(names(old_files[h])," imported"))
err.summ<-data.frame(Country=NULL, Name=NULL, Expected=NULL, Actual=NULL) #Error Cal
# Creates data frame for each sheet in snames
for (i in seq_along(snames)) {
print(paste0("Processing ..",snames[i]))
# Select the column names from the field description sheet
print("Creating template")
sel.template.desc <-
saptemplate[saptemplate$`Sheet Name` == snames[i], ]
print("Creating column names")
sel.template.desc.colnames <- sel.template.desc$Header
# Create a list by adding values from corresponding legacy data
temp <- NULL
print("adding values to template ")
for (j in seq_along(sel.template.desc.colnames)) {
temp[j] <-ifelse(sel.template.desc$oldkey[j]=="NA" | is.na(sel.template.desc$oldkey[j]),
NA,as.vector(old.copy[, sel.template.desc$oldkey[j]])
)
}
# Rename the columns according to field description
print("renaming template ")
names(temp) <- sel.template.desc.colnames
# Create data frame from the list
df <- as.data.frame(temp)
print("Converted to data frame")
# Error summary file
Expected<-nrow(df)
#Select essential rows
print("Identifying essential rows")
sel.template.desc |>
filter(Mandatory == "Yes") |>
pull(Header) -> essential.columns
error.mandatory <- NULL
error.df<-data.frame(Country=NULL, Name=NULL, Rows=NULL, Expected=NULL)
# Operate on essential columns including creation of error file
for (k in seq_along(essential.columns)) {
if(essential.columns[k]=="International_Version"){
print("Found International Version. Adding 0.")
#stop()
df$International_Version<-"0"
}
print("Creating and writing data with missing mandatory values")
assign(
paste0(
"error_mandatory_",
substr(names(old_files[h]), 2, 3),
"_",
snames[i],
"_",
essential.columns[k]
),
df[is.na(df[, essential.columns[k]]), ]
)
# TO be saved in error files
if(nrow(df[is.na(df[, essential.columns[k]]), ])>0){
write.csv(
df[is.na(df[, essential.columns[k]]), ],
paste0(
"./contacts/errors/mandatory/",
substr(names(old_files[h]), 2, 3),
"_",
snames[i],
"_",
essential.columns[k],
"_error_mandatory.csv"
), row.names = F, na=""
)
}
# Error summary file
Country<-substr(names(old_files[h]), 2, 3)
Name<-snames[i]
err.type<-paste0("Missing ",essential.columns[k])
err.count<-nrow(df[is.na(df[, essential.columns[k]]), ])
print("Removing rows with empty essetial columns")
df <- df[!is.na(df[, essential.columns[k]]), ]
if(err.count>0){
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal
}
}
print("Identifying columns associated with codelists")
# List of columns that have a codelist
codelistcols <- sel.template.desc |>
filter(!is.na(`CodeList File Path`)) |> pull(Header)
for (k in seq_along(codelistcols)) {
if(codelistcols[k]=="International_Version"){
print("Found International Version. Adding 0.")
df$International_Version<-"0"
}
print(paste0("Identifying errors ",codelistcols[k]))
def.rows <-
which(!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA))
def.n<- df[def.rows, 1]
def.rows.val <-
df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]]
def <- data.frame(def.rows, def.n,def.rows.val)
if(nrow(def>0)){
assign(paste0(
"error_codematch_",
substr(names(old_files[1]), 1, 2),
"_",
snames[i],
"_",
codelistcols[k]
),
def) # TO be saved
write.csv(
def,
paste0(
"./contacts/errors/codelist/",
substr(names(old_files[h]), 2, 3),
"_",
snames[i],
"_",
codelistcols[k],
"_error_codematch_.csv"
), row.names = F, na=""
)
}
err.type<-paste0("Codelist Mismatch ", codelistcols[k]) #Error cal
err.count<-nrow(def) #Error cal
if(err.count>0){
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal
}
print(paste0("Removing errors ",codelistcols[k]))
# Removes any mismatch
df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]] <-
NA
# Matches each column with the corresponding code list and returns the value
df[, codelistcols[k]] <-
as.character(pull(codelist_files[codelistcols[k]][[1]], 2)[match(pull(df, codelistcols[k]),
pull(codelist_files[codelistcols[k]][[1]], Description))])
}
max.length <- as.numeric(sel.template.desc$`Max Length`)
dtype <- sel.template.desc$`Data Type`
rowval <- NULL
ival <- NULL
rval <- NULL
lenght.issue.df <- NULL
# Changing the data class
for (k in 1:ncol(df)) {
if (dtype[k] == "String") {
df[, k] <- as.character(pull(df, k))
}
if (dtype[k] == "Boolean") {
df[, k] <- as.logical(pull(df, k))
}
if (dtype[k] == "DateTime") {
df[, k] <- lubridate::ymd_hms(pull(df, k))
}
if (dtype[k] == "Time") {
df[, k] <- lubridate::hms(pull(df, k))
} # This list will increase and also change based on input date and time formats
}
print("Rectifying streetname")
# Street and House Number
if (any(colnames(df) == "Street")) {
df$Streetname<-NA
df$HouseNumber<-NA
extract(df,
"Street",
c("Streetname", "HouseNumber"),
"(\\D+)(\\d.*)")
df <- df |>
select(-c("Street", "House_Number")) |>
rename(Street = Streetname, House_Number = HouseNumber) |>
select(all_of(sel.template.desc.colnames))
}
# Length Rectification
colclasses <- lapply(df, class)
print("Rectifying Length")
for (k in 1:ncol(df)) {
if (colclasses[[k]] == "character") {
print("found character column ")
rowval <- pull(df, 1)
ival <- ifelse(nchar(pull(df, k))== 0 | is.na(nchar(pull(df, k))),1,nchar(pull(df, k)))
rval <- max.length[k]
# rectifying data length
df[, k] <-
ifelse(nchar(pull(df, k)) > max.length[k],
substring(pull(df, k), 1, max.length[k]),
pull(df, k))
}
lenght.issue.df <-
rbind(lenght.issue.df, data.frame(rowval, ival, rval))
err.type<- paste0("Length error ", colnames(df)[k]) # Error cal
err.count<- sum(ival>rval, na.rm = T) # Error cal
if(err.count>0){
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal
}
}
lenght.issue.df <- dplyr::filter(lenght.issue.df,ival>rval)
if(nrow(lenght.issue.df)>0){
write.csv(lenght.issue.df,
paste0(
"./contacts/errors/length/",
substr(names(old_files[h]), 2, 3),
"_",
snames[i],
"_length_error.csv"
), row.names = F, na="")
}
assign(snames[i], df)
write.csv(df,paste0("./contacts/output/", substr(names(old_files[h]), 2, 3), "_", snames[i],".csv"), row.names = F, na="")
if(nrow(error.df)>0){
write.csv(error.df, paste0("./contacts/summary/",substr(names(old_files[h]), 2, 3), "_", snames[i],"_error",".csv"), row.names = F, na="") # Error write
}
err.summ<-rbind(err.summ,data.frame(Country=Country, Name=Name, Expected=Expected, Actual=nrow(df))) #Error Cal
}
write.csv(err.summ,
paste0("./contacts/summary/" ,substr(names(old_files[h]), 2, 3), "_", snames[i],"_sumerror",".csv"), row.names = F, na="") # Error Write
}
end<-Sys.time()
end-strt
```
*The code failed because Department Column appears several times in the data and while importing R renamed them to Department..xx).*
*Manually verify if these are the required templates*

View File

@@ -0,0 +1,13 @@
Version: 1.0
RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default
EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8
RnwWeave: Sweave
LaTeX: pdfLaTeX

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

373
Projects.Rmd Normal file
View File

@@ -0,0 +1,373 @@
---
title: "Projects"
author: "Scary Scarecrow"
date: "1/12/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(readxl)
library(dplyr)
library(lubridate)
library(DT)
library(tidyr)
mutlstxlrdr<-function(){
for( i in seq_along(sheet.na)){
colnames<-unique(saptemplate[saptemplate$`Sheet Name`==snames[i],]$Header)
df<-read.table("", col.names = colnames)
assign(snames[i], df)
}
}
```
## Data transformation workflow
Following is the proposed preliminary workflow for the data transformation project.
>All file of a segment (contacts/accounts etc..) should be inside the relevant folder. Each folder should have one folder for all codelist files. All legacy data (one file for each country) should be inside the raw-data folder, named after each country. Another file having field definitions including name of the matching column from the legacy file should also be there.
>*Make sure that there are no hidden files inside the directory.*
### Code Lists
```{r Create List of Files, echo=TRUE, message=FALSE, warning=FALSE}
filenames <- list.files("./projects/CodeList", pattern="*.xlsx", full.names = T) # We can avoid creating a separate directory for code list. But organizing may be difficult. However, this can be explored further if we want transform all the data in one go i.e. not by functions (contacts, accounts etc.).
# File paths
print(filenames)
```
Check manually if the above list includes all the codelist files
If correct, then read the files.
```{r codelistreader, echo=TRUE, message=FALSE, warning=FALSE}
sheet_names<-lapply(filenames, excel_sheets) # Creates a list of the sheet names
codelist_files<-NULL
for(i in seq_along(filenames)){
a<-lapply(excel_sheets(filenames[[i]]), read_excel, path = filenames[[i]], col_types = "text") # Reads the sheets of the excel files
names(a)<-c(sheet_names[[i]]) # Renames them according to the sheet names extracted above
codelist_files<-c(codelist_files,a)
}
# Names of the files imported
names(codelist_files)
#codelist_files<-unique(codelist_files)
codelist_files$Academic_Title
```
### Templates
Let us now extract the data. Below we are reading only one file having all data related to `Contacts` from the legacy system.
```{r readlegacyfilepath, echo=TRUE, message=FALSE, warning=FALSE}
oldfilepath<-list.files("./projects/raw-data", pattern="*.xls", full.names = T) # Change the path, check pattern
print(oldfilepath)
```
Check it the list matches the actual files, manually.
```{r readlegacyfiles, echo=TRUE}
old_files<-NULL
#read_excel(path = oldfilepath[[i]], sheet = 1)
for(i in seq_along(oldfilepath)){
old_files[[i]]<-read_excel(path = oldfilepath[[i]], sheet = 1)
}
names(old_files)<-gsub("./projects/raw-data/","",oldfilepath)
```
*Some errors in the legacy file noticed. Columns with similar or same name exists.*
```{r readSAPtemplate, echo=TRUE, message=FALSE, warning=FALSE}
saptemplate<-read_excel("./projects/template.xlsx", sheet = "Field_Definitions")
# First few rows of the imported data
head(saptemplate)
```
*Please note that the format of the tables (sheet) has been slightly changed. Earlier the corresponding sheet name was mentioned in a row before the actual table. Now, all the rows mention the corresponding sheet name. This was done manually for convenience of data extraction*
```{r createmptySAPfiles, echo=TRUE, message=FALSE, warning=FALSE}
#orilo<-"en_US.UTF-8"
#Sys.setlocale(locale="en_US.UTF-8")
strt<-Sys.time()
snames <- unique(saptemplate$`Sheet Name`)
for (h in seq_along(old_files)) {
# Copy original data
old.copy <- old_files[[h]]
print(paste0(names(old_files[h])," imported"))
err.summ<-data.frame(Country=NULL, Name=NULL, Expected=NULL, Actual=NULL) #Error Cal
# Creates data frame for each sheet in snames
for (i in seq_along(snames)) {
print(paste0("Processing ..",snames[i]))
# Select the column names from the field description sheet
print("Creating template")
sel.template.desc <-
saptemplate[saptemplate$`Sheet Name` == snames[i], ]
print("Creating column names")
sel.template.desc.colnames <- sel.template.desc$Header
# Create a list by adding values from corresponding legacy data
temp <- NULL
print("adding values to template ")
for (j in seq_along(sel.template.desc.colnames)) {
temp[j] <-ifelse(sel.template.desc$oldkey[j]=="NA" | is.na(sel.template.desc$oldkey[j]),
NA,as.vector(old.copy[, sel.template.desc$oldkey[j]])
)
}
# Rename the columns according to field description
print("renaming template ")
names(temp) <- sel.template.desc.colnames
# Create data frame from the list
df <- as.data.frame(temp)
print("Converted to data frame")
# Error summary file
Expected<-nrow(df)
#Select essential rows
print("Identifying essential rows")
sel.template.desc |>
filter(Mandatory == "Yes") |>
pull(Header) -> essential.columns
error.mandatory <- NULL
error.df<-data.frame(Country=NULL, Name=NULL, Rows=NULL, Expected=NULL)
# Operate on essential columns including creation of error file
for (k in seq_along(essential.columns)) {
if(essential.columns[k]=="Currency"){
print("Found Currency. Adding 0.")
#stop()
df$International_Version<-"CHF"
}
print("Creating and writing data with missing mandatory values")
assign(
paste0(
"error_mandatory_",
substr(names(old_files[h]), 1, 2),
"_",
snames[i],
"_",
essential.columns[k]
),
df[is.na(df[, essential.columns[k]]), ]
)
# TO be saved in error files
if(nrow(df[is.na(df[, essential.columns[k]]), ])>0){
write.csv(
df[is.na(df[, essential.columns[k]]), ],
paste0(
"./projects/errors/mandatory/",
substr(names(old_files[h]), 1, 2),
"_",
snames[i],
"_",
essential.columns[k],
"_error_mandatory.csv"
), row.names = F, na=""
)
}
# Error summary file
Country<-substr(names(old_files[h]), 1, 2)
Name<-snames[i]
err.type<-paste0("Missing ",essential.columns[k])
err.count<-nrow(df[is.na(df[, essential.columns[k]]), ])
print("Removing rows with empty essetial columns")
df <- df[!is.na(df[, essential.columns[k]]), ]
if(err.count>0){
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal
}
}
print("Identifying columns associated with codelists")
# List of columns that have a codelist
codelistcols <- sel.template.desc |>
filter(!is.na(`CodeList File Path`)) |> pull(Header)
for (k in seq_along(codelistcols)) {
if(codelistcols[k]=="Currency"){
print("Found Currency. Adding 0.")
df$International_Version<-"CHF"
}
print(paste0("Identifying errors ",codelistcols[k]))
def.rows <-
which(!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA))
def.n<- df[def.rows, 1]
def.rows.val <-
df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]]
def <- data.frame(def.rows, def.n,def.rows.val)
if(nrow(def>0)){
assign(paste0(
"error_codematch_",
substr(names(old_files[1]), 1, 2),
"_",
snames[i],
"_",
codelistcols[k]
),
def) # TO be saved
write.csv(
def,
paste0(
"./projects/errors/codelist/",
substr(names(old_files[h]), 1, 2),
"_",
snames[i],
"_",
codelistcols[k],
"_error_codematch_.csv"
), row.names = F, na=""
)
}
err.type<-paste0("Codelist Mismatch ", codelistcols[k]) #Error cal
err.count<-nrow(def) #Error cal
if(err.count>0){
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal
}
print(paste0("Removing errors ",codelistcols[k]))
# Removes any mismatch
df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]] <-
NA
# Matches each column with the corresponding code list and returns the value
df[, codelistcols[k]] <-
pull(codelist_files[codelistcols[k]][[1]], 2)[match(pull(df, codelistcols[k]),
pull(codelist_files[codelistcols[k]][[1]], Description))]
}
max.length <- as.numeric(sel.template.desc$`Max Length`)
dtype <- sel.template.desc$`Data Type`
rowval <- NULL
ival <- NULL
rval <- NULL
lenght.issue.df <- NULL
# Changing the data class
for (k in 1:ncol(df)) {
if (dtype[k] == "String") {
df[, k] <- as.character(pull(df, k))
}
if (dtype[k] == "Boolean") {
df[, k] <- as.logical(pull(df, k))
}
if (dtype[k] == "DateTime") {
df[, k] <- lubridate::ymd_hms(pull(df, k))
}
if (dtype[k] == "Time") {
df[, k] <- lubridate::hms(pull(df, k))
} # This list will increase and also change based on input date and time formats
}
print("Rectifying streetname")
# Street and House Number
if (any(colnames(df) == "Street")) {
df$Streetname<-NA
df$HouseNumber<-NA
extract(df,
"Street",
c("Streetname", "HouseNumber"),
"(\\D+)(\\d.*)")
df <- df |>
select(-c("Street", "House_Number")) |>
rename(Street = Streetname, House_Number = HouseNumber) |>
select(sel.template.desc.colnames)
}
# Length Rectification
colclasses <- lapply(df, class)
print("Rectifying Length")
for (k in 1:ncol(df)) {
if (colclasses[[k]] == "character") {
print("found character column ")
rowval <- pull(df, 1)
ival <- ifelse(nchar(pull(df, k))== 0 | is.na(nchar(pull(df, k))),1,nchar(pull(df, k)))
rval <- max.length[k]
# rectifying data length
df[, k] <-
ifelse(nchar(pull(df, k)) > max.length[k],
substring(pull(df, k), 1, max.length[k]),
pull(df, k))
}
lenght.issue.df <-
rbind(lenght.issue.df, data.frame(rowval, ival, rval))
err.type<- paste0("Length error ", colnames(df)[k]) # Error cal
err.count<- sum(ival>rval, na.rm = T) # Error cal
if(err.count>0){
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal
}
}
lenght.issue.df <- dplyr::filter(lenght.issue.df,ival>rval)
if(nrow(lenght.issue.df)>0){
write.csv(lenght.issue.df,
paste0(
"./projects/errors/length/",
substr(names(old_files[h]), 1, 2),
"_",
snames[i],
"_length_error.csv"
), row.names = F, na="")
}
assign(snames[i], df)
write.csv(df,paste0("./projects/output/", substr(names(old_files[h]), 1, 2), "_", snames[i],".csv"), row.names = F, na="")
if(nrow(error.df)>0){
write.csv(error.df, paste0("./projects/summary/",substr(names(old_files[h]), 1, 2), "_", snames[i],"_error",".csv"), row.names = F, na="") # Error write
}
err.summ<-rbind(err.summ,data.frame(Country=Country, Name=Name, Expected=Expected, Actual=nrow(df))) #Error Cal
}
write.csv(err.summ,
paste0("./projects/summary/" ,substr(names(old_files[h]), 1, 2), "_", snames[i],"_sumerror",".csv"), row.names = F, na="") # Error Write
}
end<-Sys.time()
end-strt
```
*The code failed because Department Column appears several times in the data and while importing R renamed them to Department..xx).*
*Manually verify if these are the required templates*

398
Report.Rmd Normal file
View File

@@ -0,0 +1,398 @@
---
title: "Report"
author: "Data Science Team, LaNubia"
date: "1/11/2022"
output:
html_document:
theme: lumen
highlight: tango
self_contained: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, error=TRUE, message=FALSE, warning=FALSE)
library(readxl)
library(DT)
library(tidyr)
library(dplyr)
rxl<- function(path,...){
tryCatch(read_excel(path,...), error= function(c){
c$message<-"No Data"
print("No Data")
stop(c)
})
}
ltodf<- function(path,...){
tryCatch(rbind.data.frame(path,...), error= function(c){
c$message<-"No Data"
print("No Data")
stop(c)
})
}
```
## Status Report
### Input Available
```{r echo=FALSE, message=FALSE, warning=FALSE}
contactinputpath<-list.files("./contacts/raw-data", pattern="*.xlsx", full.names = T)
accountinputpath<-list.files("./accounts/raw-data", pattern="*.xls", full.names = T)
projectinputpath<-list.files("./projects/raw-data", pattern="*.xls", full.names = T)
supportinputpath<-list.files("./support/raw-data", pattern="*.xls", full.names = T)
conta<-lapply(contactinputpath, read_excel)
names(conta)<-gsub("./contacts/raw-data/","",contactinputpath)
c<-lapply(conta, nrow)
Input_data<-"Contact"
#Country<-gsub(".xlsx","",names(conta))
Observations<-c
temp<-data.frame(Input_data,Observations) |>
pivot_longer(cols = (-1), names_to = "Country", values_to = "Observations") |>
mutate(Country=gsub(".xlsx","",Country))
input.summary<-temp
acco<-lapply(accountinputpath, read_excel)
names(acco)<-gsub("./accounts/raw-data/","",accountinputpath)
a<-lapply(acco, nrow)
Input_data<-"Accounts"
#Country<-gsub(".xlsx","",names(conta))
Observations<-a
temp<-data.frame(Input_data,Observations) |>
pivot_longer(cols = (-1), names_to = "Country", values_to = "Observations") |>
mutate(Country=gsub(".xls","",Country))
input.summary<-rbind(input.summary,temp)
proja<-lapply(projectinputpath, read_excel)
names(proja)<-gsub("./projects/raw-data/","",projectinputpath)
p<-lapply(proja, nrow)
Input_data<-"Projects"
#Country<-gsub(".xlsx","",names(conta))
Observations<-p
temp<-data.frame(Input_data,Observations) |>
pivot_longer(cols = (-1), names_to = "Country", values_to = "Observations") |>
mutate(Country=gsub(".xls","",Country))
input.summary<-rbind(input.summary,temp)
suppo<-lapply(supportinputpath, read_excel)
names(suppo)<-gsub("./support/raw-data/","",supportinputpath)
s<-lapply(suppo, nrow)
Input_data<-"Support"
#Country<-gsub(".xlsx","",names(conta))
Observations<-s
temp<-data.frame(Input_data,Observations) |>
pivot_longer(cols = (-1), names_to = "Country", values_to = "Observations") |>
mutate(Country=gsub(".xls","",Country))
input.summary<-rbind(input.summary,temp)
datatable(input.summary, extensions = "Buttons",
options = list(paging = TRUE,
scrollX=TRUE,
searching = TRUE,
ordering = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf'),
pageLength=5,
lengthMenu=c(3,5,10) ))
```
Simplified view
```{r echo=FALSE}
input.summary |>
pivot_wider(names_from = Country, values_from = Observations) |> datatable(extensions = "Buttons",
options = list(paging = TRUE,
scrollX=TRUE,
searching = TRUE,
ordering = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf'),
pageLength=5,
lengthMenu=c(3,5,10) ))
```
### Contacts
#### Template
SAP templates available:
```{r echo=FALSE}
datatable(data.frame(Templates=unique(rxl("./contacts/template.xlsx", sheet = "Field_Definitions")[,1])), extensions = "Buttons",
options = list(paging = TRUE,
scrollX=TRUE,
searching = TRUE,
ordering = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf'),
pageLength=5,
lengthMenu=c(3,5,10) ))
```
#### Summary of Errors
```{r echo=FALSE, message=FALSE, warning=FALSE}
sumerrfilepath<-list.files("./contacts/summary", pattern="*sumerror.csv", full.names = T)
errfilepath<-list.files("./contacts/summary", pattern="*_error.csv", full.names = T)
sumerrfiles<-lapply(sumerrfilepath, read.csv)
datatable(do.call(ltodf, sumerrfiles), extensions = "Buttons",
options = list(paging = TRUE,
scrollX=TRUE,
searching = TRUE,
ordering = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf'),
pageLength=5,
lengthMenu=c(3,5,10) ))
```
#### Error by template
```{r echo=FALSE, message=FALSE, warning=FALSE}
errfiles<-lapply(errfilepath, read.csv)
datatable(do.call(ltodf, errfiles), extensions = "Buttons",
options = list(paging = TRUE,
scrollX=TRUE,
searching = TRUE,
ordering = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf'),
pageLength=5,
lengthMenu=c(3,5,10) ))
```
### Accounts
#### Template
SAP templates available:
```{r echo=FALSE}
datatable(data.frame(Templates=unique(rxl("./accounts/template.xlsx", sheet = "Field_Definitions")[,1])), extensions = "Buttons",
options = list(paging = TRUE,
scrollX=TRUE,
searching = TRUE,
ordering = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf'),
pageLength=5,
lengthMenu=c(3,5,10) ))
```
#### Summary of Errors
```{r echo=FALSE, message=FALSE, warning=FALSE}
sumerrfilepath<-list.files("./accounts/summary", pattern="*sumerror.csv", full.names = T)
errfilepath<-list.files("./accounts/summary", pattern="*_error.csv", full.names = T)
sumerrfiles<-lapply(sumerrfilepath, read.csv)
datatable(do.call(ltodf, sumerrfiles), extensions = "Buttons",
options = list(paging = TRUE,
scrollX=TRUE,
searching = TRUE,
ordering = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf'),
pageLength=5,
lengthMenu=c(3,5,10) ))
```
#### Error by template
```{r echo=FALSE, message=FALSE, warning=FALSE}
errfiles<-lapply(errfilepath, read.csv)
datatable(do.call(ltodf, errfiles), extensions = "Buttons",
options = list(paging = TRUE,
scrollX=TRUE,
searching = TRUE,
ordering = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf'),
pageLength=5,
lengthMenu=c(3,5,10) ))
```
### Projects
#### Template
SAP templates available:
```{r echo=FALSE}
datatable(data.frame(Templates=unique(rxl("./projects/template.xlsx", sheet = "Field_Definitions")[,1])), extensions = "Buttons",
options = list(paging = TRUE,
scrollX=TRUE,
searching = TRUE,
ordering = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf'),
pageLength=5,
lengthMenu=c(3,5,10) ))
```
#### Summary of Errors
```{r echo=FALSE, message=FALSE, warning=FALSE}
sumerrfilepath<-list.files("./projects/summary", pattern="*sumerror.csv", full.names = T)
errfilepath<-list.files("./projects/summary", pattern="*_error.csv", full.names = T)
sumerrfiles<-lapply(sumerrfilepath, read.csv)
datatable(do.call(ltodf, sumerrfiles), extensions = "Buttons",
options = list(paging = TRUE,
scrollX=TRUE,
searching = TRUE,
ordering = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf'),
pageLength=5,
lengthMenu=c(3,5,10) ))
```
#### Error by template
```{r echo=FALSE, message=FALSE, warning=FALSE}
errfiles<-lapply(errfilepath, read.csv)
datatable(do.call(ltodf, errfiles), extensions = "Buttons",
options = list(paging = TRUE,
scrollX=TRUE,
searching = TRUE,
ordering = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf'),
pageLength=5,
lengthMenu=c(3,5,10) ))
```
### Support
#### Template
SAP templates available:
```{r echo=FALSE}
datatable(data.frame(Templates=unique(rxl("./support/template.xlsx", sheet = "Field_Definitions")[,1])), extensions = "Buttons",
options = list(paging = TRUE,
scrollX=TRUE,
searching = TRUE,
ordering = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf'),
pageLength=5,
lengthMenu=c(3,5,10) ))
```
#### Summary of Errors
```{r echo=FALSE, message=FALSE, warning=FALSE}
sumerrfilepath<-list.files("./support/summary", pattern="*sumerror.csv", full.names = T)
errfilepath<-list.files("./support/summary", pattern="*_error.csv", full.names = T)
sumerrfiles<-lapply(sumerrfilepath, read.csv)
datatable(do.call(ltodf, sumerrfiles), extensions = "Buttons",
options = list(paging = TRUE,
scrollX=TRUE,
searching = TRUE,
ordering = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf'),
pageLength=5,
lengthMenu=c(3,5,10) ))
```
#### Error by template
```{r echo=FALSE, message=FALSE, warning=FALSE}
errfiles<-lapply(errfilepath, read.csv)
datatable(do.call(ltodf, errfiles), extensions = "Buttons",
options = list(paging = TRUE,
scrollX=TRUE,
searching = TRUE,
ordering = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf'),
pageLength=5,
lengthMenu=c(3,5,10) ))
```

313
Report.html Normal file

File diff suppressed because one or more lines are too long

Binary file not shown.

374
Support.Rmd Normal file
View File

@@ -0,0 +1,374 @@
---
title: "Support"
author: "Scary Scarecrow"
date: "1/12/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(readxl)
library(dplyr)
library(lubridate)
library(DT)
library(tidyr)
mutlstxlrdr<-function(){
for( i in seq_along(sheet.na)){
colnames<-unique(saptemplate[saptemplate$`Sheet Name`==snames[i],]$Header)
df<-read.table("", col.names = colnames)
assign(snames[i], df)
}
}
```
## Data transformation workflow
Following is the proposed preliminary workflow for the data transformation project.
>All file of a segment (support/accounts etc..) should be inside the relevant folder. Each folder should have one folder for all codelist files. All legacy data (one file for each country) should be inside the raw-data folder, named after each country. Another file having field definitions including name of the matching column from the legacy file should also be there.
>*Make sure that there are no hidden files inside the directory.*
### Code Lists
```{r Create List of Files, echo=TRUE, message=FALSE, warning=FALSE}
filenames <- list.files("./support/CodeList", pattern="*.xlsx", full.names = T) # We can avoid creating a separate directory for code list. But organizing may be difficult. However, this can be explored further if we want transform all the data in one go i.e. not by functions (support, accounts etc.).
# File paths
print(filenames)
```
Check manually if the above list includes all the codelist files
If correct, then read the files.
```{r codelistreader, echo=TRUE, message=FALSE, warning=FALSE}
sheet_names<-lapply(filenames, excel_sheets) # Creates a list of the sheet names
codelist_files<-NULL
for(i in seq_along(filenames)){
a<-lapply(excel_sheets(filenames[[i]]), read_excel, path = filenames[[i]], col_types = "text") # Reads the sheets of the excel files
names(a)<-c(sheet_names[[i]]) # Renames them according to the sheet names extracted above
codelist_files<-c(codelist_files,a)
}
# Names of the files imported
names(codelist_files)
#codelist_files<-unique(codelist_files)
codelist_files$Academic_Title
```
### Templates
Let us now extract the data. Below we are reading only one file having all data related to `support` from the legacy system.
```{r readlegacyfilepath, echo=TRUE, message=FALSE, warning=FALSE}
oldfilepath<-list.files("./support/raw-data", pattern="*.xls", full.names = T) # Change the path, check pattern
print(oldfilepath)
```
Check it the list matches the actual files, manually.
```{r readlegacyfiles, echo=TRUE}
old_files<-NULL
#read_excel(path = oldfilepath[[i]], sheet = 1)
for(i in seq_along(oldfilepath)){
old_files[[i]]<-read_excel(path = oldfilepath[[i]], sheet = 1)
}
names(old_files)<-gsub("./support/raw-data/","",oldfilepath)
```
*Some errors in the legacy file noticed. Columns with similar or same name exists.*
```{r readSAPtemplate, echo=TRUE, message=FALSE, warning=FALSE}
saptemplate<-read_excel("./support/template.xlsx", sheet = "Field_Definitions")
# First few rows of the imported data
head(saptemplate)
```
*Please note that the format of the tables (sheet) has been slightly changed. Earlier the corresponding sheet name was mentioned in a row before the actual table. Now, all the rows mention the corresponding sheet name. This was done manually for convenience of data extraction*
```{r createmptySAPfiles, echo=TRUE, message=FALSE, warning=FALSE}
#orilo<-"en_US.UTF-8"
#Sys.setlocale(locale="en_US.UTF-8")
strt<-Sys.time()
snames <- unique(saptemplate$`Sheet Name`)
for (h in seq_along(old_files)) {
# Copy original data
old.copy <- old_files[[h]]
print(paste0(names(old_files[h])," imported"))
err.summ<-data.frame(Country=NULL, Name=NULL, Expected=NULL, Actual=NULL) #Error Cal
# Creates data frame for each sheet in snames
for (i in seq_along(snames)) {
print(paste0("Processing ..",snames[i]))
# Select the column names from the field description sheet
print("Creating template")
sel.template.desc <-
saptemplate[saptemplate$`Sheet Name` == snames[i], ]
print("Creating column names")
sel.template.desc.colnames <- sel.template.desc$Header
# Create a list by adding values from corresponding legacy data
temp <- NULL
print("adding values to template ")
for (j in seq_along(sel.template.desc.colnames)) {
temp[j] <-ifelse(sel.template.desc$oldkey[j]=="NA" | is.na(sel.template.desc$oldkey[j]),
NA,as.vector(old.copy[, sel.template.desc$oldkey[j]])
)
}
# Rename the columns according to field description
print("renaming template ")
names(temp) <- sel.template.desc.colnames
# Create data frame from the list
df <- as.data.frame(temp)
print("Converted to data frame")
# Error summary file
Expected<-nrow(df)
#Select essential rows
print("Identifying essential rows")
sel.template.desc |>
filter(Mandatory == "Yes") |>
pull(Header) -> essential.columns
error.mandatory <- NULL
error.df<-data.frame(Country=NULL, Name=NULL, Rows=NULL, Expected=NULL)
# Operate on essential columns including creation of error file
for (k in seq_along(essential.columns)) {
if(essential.columns[k]=="International_Version"){
print("Found International Version. Adding 0.")
#stop()
df$International_Version<-"0"
}
print("Creating and writing data with missing mandatory values")
assign(
paste0(
"error_mandatory_",
substr(names(old_files[h]), 1, 2),
"_",
snames[i],
"_",
essential.columns[k]
),
df[is.na(df[, essential.columns[k]]), ]
)
# TO be saved in error files
if(nrow(df[is.na(df[, essential.columns[k]]), ])>0){
write.csv(
df[is.na(df[, essential.columns[k]]), ],
paste0(
"./support/errors/mandatory/",
substr(names(old_files[h]), 1, 2),
"_",
snames[i],
"_",
essential.columns[k],
"_error_mandatory.csv"
), row.names = F, na=""
)
}
# Error summary file
Country<-substr(names(old_files[h]), 1, 2)
Name<-snames[i]
err.type<-paste0("Missing ",essential.columns[k])
err.count<-nrow(df[is.na(df[, essential.columns[k]]), ])
print("Removing rows with empty essetial columns")
df <- df[!is.na(df[, essential.columns[k]]), ]
if(err.count>0){
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal
}
}
print("Identifying columns associated with codelists")
# List of columns that have a codelist
codelistcols <- sel.template.desc |>
filter(!is.na(`CodeList File Path`)) |> pull(Header)
for (k in seq_along(codelistcols)) {
if(codelistcols[k]=="International_Version"){
print("Found International Version. Adding 0.")
df$International_Version<-"0"
}
print(paste0("Identifying errors ",codelistcols[k]))
def.rows <-
which(!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA))
def.n<- df[def.rows, 1]
def.rows.val <-
df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]]
def <- data.frame(def.rows, def.n,def.rows.val)
if(nrow(def>0)){
assign(paste0(
"error_codematch_",
substr(names(old_files[1]), 1, 2),
"_",
snames[i],
"_",
codelistcols[k]
),
def) # TO be saved
write.csv(
def,
paste0(
"./support/errors/codelist/",
substr(names(old_files[h]), 1, 2),
"_",
snames[i],
"_",
codelistcols[k],
"_error_codematch_.csv"
), row.names = F, na=""
)
}
err.type<-paste0("Codelist Mismatch ", codelistcols[k]) #Error cal
err.count<-nrow(def) #Error cal
if(err.count>0){
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal
}
print(paste0("Removing errors ",codelistcols[k]))
# Removes any mismatch
df[!df[, codelistcols[k]] %in% c(pull(codelist_files[codelistcols[k]][[1]], Description), NA), codelistcols[k]] <-
NA
# Matches each column with the corresponding code list and returns the value
df[, codelistcols[k]] <-
pull(codelist_files[codelistcols[k]][[1]], 2)[match(pull(df, codelistcols[k]),
pull(codelist_files[codelistcols[k]][[1]], Description))]
}
max.length <- as.numeric(sel.template.desc$`Max Length`)
dtype <- sel.template.desc$`Data Type`
rowval <- NULL
ival <- NULL
rval <- NULL
lenght.issue.df <- NULL
# Changing the data class
for (k in 1:ncol(df)) {
if (dtype[k] == "String") {
df[, k] <- as.character(pull(df, k))
}
if (dtype[k] == "Boolean") {
df[, k] <- as.logical(pull(df, k))
}
if (dtype[k] == "DateTime") {
df[, k] <- lubridate::ymd_hms(pull(df, k))
}
if (dtype[k] == "Time") {
df[, k] <- lubridate::hms(pull(df, k))
} # This list will increase and also change based on input date and time formats
}
print("Rectifying streetname")
# Street and House Number
if (any(colnames(df) == "Street")) {
df$Streetname<-NA
df$HouseNumber<-NA
extract(df,
"Street",
c("Streetname", "HouseNumber"),
"(\\D+)(\\d.*)")
df <- df |>
select(-c("Street", "House_Number")) |>
rename(Street = Streetname, House_Number = HouseNumber) |>
select(sel.template.desc.colnames)
}
# Length Rectification
colclasses <- lapply(df, class)
print("Rectifying Length")
for (k in 1:ncol(df)) {
if (colclasses[[k]] == "character") {
print("found character column ")
rowval <- pull(df, 1)
ival <- ifelse(nchar(pull(df, k))== 0 | is.na(nchar(pull(df, k))),1,nchar(pull(df, k)))
rval <- max.length[k]
# rectifying data length
df[, k] <-
ifelse(nchar(pull(df, k)) > max.length[k],
substring(pull(df, k), 1, max.length[k]),
pull(df, k))
}
lenght.issue.df <-
rbind(lenght.issue.df, data.frame(rowval, ival, rval))
err.type<- paste0("Length error ", colnames(df)[k]) # Error cal
err.count<- sum(ival>rval, na.rm = T) # Error cal
if(err.count>0){
error.df<-rbind(error.df,data.frame(Country=Country, Name=Name, err.type=err.type, err.count=err.count)) #Error cal
}
}
lenght.issue.df <- dplyr::filter(lenght.issue.df,ival>rval)
if(nrow(lenght.issue.df)>0){
write.csv(lenght.issue.df,
paste0(
"./support/errors/length/",
substr(names(old_files[h]), 1, 2),
"_",
snames[i],
"_length_error.csv"
), row.names = F, na="")
}
assign(snames[i], df)
write.csv(df,paste0("./support/output/", substr(names(old_files[h]), 1, 2), "_", snames[i],".csv"), row.names = F, na="")
if(nrow(error.df)>0){
write.csv(error.df, paste0("./support/summary/",substr(names(old_files[h]), 1, 2), "_", snames[i],"_error",".csv"), row.names = F, na="") # Error write
}
err.summ<-rbind(err.summ,data.frame(Country=Country, Name=Name, Expected=Expected, Actual=nrow(df))) #Error Cal
}
write.csv(err.summ,
paste0("./support/summary/" ,substr(names(old_files[h]), 1, 2), "_", snames[i],"_sumerror",".csv"), row.names = F, na="") # Error Write
}
end<-Sys.time()
end-strt
```
*The code failed because Department Column appears several times in the data and while importing R renamed them to Department..xx).*
*Manually verify if these are the required templates*

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1 @@
<?xml version="1.0" encoding="utf-8"?><Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types"><Default Extension="xml" ContentType="application/octet-stream" /></Types>

View File

@@ -0,0 +1 @@
<?xml version="1.0" encoding="utf-8"?><Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types"><Default Extension="xml" ContentType="application/octet-stream" /></Types>

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Some files were not shown because too many files have changed in this diff Show More