Screening Input Data Sources

Summary reports for input data sources used by the statistical modules

Observation Count

UNSD Tariffline

  • Using all chapters; chapter column added to tables before 2009 only recently

Partner Dimension

  • x-axis: time period
  • y-axis: number of observations by partner

Database query output

## 'data.frame':    3819 obs. of  3 variables:
##  $ year: int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ prt : Factor w/ 263 levels "4","8","10","12",..: 3 31 32 33 34 35 4 36 37 38 ...
##  $ cnt : int  366 122448 15465 2773 47081 34981 35575 23670 244169 8205 ...

Summary statistics

##       year           prt            cnt         
##  Min.   :2000   4      :  15   Min.   :      1  
##  1st Qu.:2003   8      :  15   1st Qu.:   4957  
##  Median :2007   10     :  15   Median :  17560  
##  Mean   :2007   12     :  15   Mean   :  65528  
##  3rd Qu.:2011   16     :  15   3rd Qu.:  62636  
##  Max.   :2014   20     :  15   Max.   :1897129  
##                 (Other):3729

Values larger than 10^{6}

##    year prt     cnt
## 1  2014 156 1104898
## 2  2013 156 1117188
## 3  2013 276 1034622
## 4  2012 156 1831846
## 5  2012 840 1897129
## 6  2011 156 1333591
## 7  2011 840 1714170
## 8  2010 156 1462158
## 9  2010 840 1573837
## 10 2009 156 1066428
## 11 2009 840 1774167
## 12 2000 276 1169905
## 13 2000  56 1022478
## 14 2000 840 1509820

Counting the number of observations by year and prt between 2000 and 2014.

  • the SWS source datatable id is ct_tariffline_unlogged_[year]
  • the result has been obtained using the following query:
spark.sql("SELECT year, prt, COUNT(*) AS cnt FROM parquetTable GROUP BY year, prt ORDER BY year, prt")
  • count_[prt] is the rowcount with the respective prt, e.g. count_HS2 are the number of observations using commodity classification H2 (HS 2002).
  • pct_max represents the share of the HS group that has the largest share in the respective year (in percent)

Dimension Member Change

Eurostat CN8

Count of reporters per period

## # A tibble: 15 × 2
##    period count
##     <chr> <int>
## 1  200052    28
## 2  200152    28
## 3  200252    29
## 4  200352    29
## 5  200452    29
## 6  200552    29
## 7  200652    29
## 8  200752    29
## 9  200852    29
## 10 200952    29
## 11 201052    29
## 12 201152    29
## 13 201252    29
## 14 201352    29
## 15 201452    29

Unique reporters with ISO3 codes

##    codeUN cou
## 1      40 AUT
## 2      56 BEL
## 3     100 BGR
## 4     196 CYP
## 5     203 CZE
## 6     276 DEU
## 7     208 DNK
## 8     724 ESP
## 9     233 EST
## 10    246 FIN
## 11    250 FRA
## 12    826 GBR
## 13    300 GRC
## 14    191 HRV
## 15    348 HUN
## 16    372 IRL
## 17    380 ITA
## 18    440 LTU
## 19    442 LUX
## 20    428 LVA
## 21    470 MLT
## 22    528 NLD
## 23    616 POL
## 24    620 PRT
## 25    642 ROU
## 26    703 SVK
## 27    705 SVN
## 28    752 SWE

Count of partners per period

## # A tibble: 15 × 2
##    period count
##     <chr> <int>
## 1  200052   232
## 2  200152   243
## 3  200252   244
## 4  200352   244
## 5  200452   244
## 6  200552   245
## 7  200652   244
## 8  200752   244
## 9  200852   243
## 10 200952   243
## 11 201052   244
## 12 201152   243
## 13 201252   243
## 14 201352   249
## 15 201452   249

Unique partners with ISO3 codes

## Warning in FUN(.): These ES Geonom codes were not converted to FAO area
## list: 479, 475, 466, 477, 977, 229, 955, 225
##   [1] "533 ABW" "4 AFG"   "24 AGO"  "8 ALB"   "784 ARE" "32 ARG"  "51 ARM" 
##   [8] "16 ASM"  "28 ATG"  "36 AUS"  "36 AUS"  "36 AUS"  "36 AUS"  "36 AUS" 
##  [15] "36 AUS"  "40 AUT"  "31 AZE"  "108 BDI" "56 BEL"  "204 BEN" "854 BFA"
##  [22] "50 BGD"  "100 BGR" "48 BHR"  "44 BHS"  "70 BIH"  "112 BLR" "84 BLZ" 
##  [29] "60 BMU"  "68 BOL"  "76 BRA"  "52 BRB"  "96 BRN"  "64 BTN"  "72 BWA" 
##  [36] "140 CAF" "124 CAN" "756 CHE" "756 CHE" "152 CHL" "384 CIV" "120 CMR"
##  [43] "180 COD" "178 COG" "184 COK" "170 COL" "174 COM" "132 CPV" "188 CRI"
##  [50] "192 CUB" "136 CYM" "196 CYP" "203 CZE" "276 DEU" "262 DJI" "212 DMA"
##  [57] "208 DNK" "214 DOM" "12 DZA"  "218 ECU" "818 EGY" "232 ERI" "724 ESP"
##  [64] "724 ESP" "724 ESP" "233 EST" "231 ETH" "231 ETH" "246 FIN" "242 FJI"
##  [71] "238 FLK" "238 FLK" "250 FRA" "250 FRA" "234 FRO" "583 FSM" "266 GAB"
##  [78] "826 GBR" "268 GEO" "288 GHA" "324 GIN" "270 GMB" "624 GNB" "226 GNQ"
##  [85] "300 GRC" "308 GRD" "304 GRL" "320 GTM" "316 GUM" "316 GUM" "328 GUY"
##  [92] "344 HKG" "340 HND" "191 HRV" "332 HTI" "348 HUN" "360 IDN" "356 IND"
##  [99] "372 IRL" "364 IRN" "368 IRQ" "352 ISL" "376 ISR" "380 ITA" "380 ITA"
## [106] "380 ITA" "388 JAM" "400 JOR" "392 JPN" "398 KAZ" "404 KEN" "417 KGZ"
## [113] "116 KHM" "296 KIR" "659 KNA" "659 KNA" "410 KOR" "414 KWT" "418 LAO"
## [120] "422 LBN" "430 LBR" "434 LBY" "662 LCA" "144 LKA" "426 LSO" "440 LTU"
## [127] "442 LUX" "428 LVA" "446 MAC" "504 MAR" "498 MDA" "450 MDG" "462 MDV"
## [134] "484 MEX" "584 MHL" "807 MKD" "466 MLI" "470 MLT" "104 MMR" "499 MNE"
## [141] "496 MNG" "508 MOZ" "478 MRT" "500 MSR" "480 MUS" "454 MWI" "458 MYS"
## [148] "175 MYT" "516 NAM" "540 NCL" "562 NER" "566 NGA" "558 NIC" "570 NIU"
## [155] "528 NLD" "578 NOR" "524 NPL" "520 NRU" "554 NZL" "554 NZL" "554 NZL"
## [162] "512 OMN" "586 PAK" "591 PAN" "604 PER" "608 PHL" "585 PLW" "598 PNG"
## [169] "616 POL" "408 PRK" "620 PRT" "600 PRY" "275 PSE" "258 PYF" "634 QAT"
## [176] "642 ROU" "643 RUS" "646 RWA" "682 SAU" "686 SEN" "702 SGP" "654 SHN"
## [183] "90 SLB"  "694 SLE" "222 SLV" "706 SOM" "666 SPM" "688 SRB" "688 SRB"
## [190] "688 SRB" "678 STP" "740 SUR" "703 SVK" "705 SVN" "752 SWE" "748 SWZ"
## [197] "690 SYC" "760 SYR" "796 TCA" "148 TCD" "768 TGO" "764 THA" "762 TJK"
## [204] "795 TKM" "776 TON" "780 TTO" "788 TUN" "792 TUR" "798 TUV" "834 TZA"
## [211] "800 UGA" "804 UKR" "858 URY" "840 USA" "840 USA" "840 USA" "860 UZB"
## [218] "670 VCT" "862 VEN" "92 VGB"  "704 VNM" "548 VUT" "876 WLF" "882 WSM"
## [225] "887 YEM" "710 ZAF" "894 ZMB" "716 ZWE" "530 NA"  "1248 NA" "158 NA" 
## [232] "729 NA"

Workflow

The workflow diagram informs about the various steps of the pre-processing

SWS Data

  1. use R faosws to retrieve datatables for each year from SWS (approx. 15-25 minutes per table, total around 30 tables)
  2. save to R native format rds to preserve data types
  3. export to csv files and compress using gzip

Processing (Spark)

  1. load data for a range of years from both data sources (Eurostat CN8 and UNSD Tariffline) and store in parquet format, partitioned by year
  2. create SparkSQL view from partitioned parquet folders and execute SparkSQL query
  3. export query result to csv file

dryworkflow (R)

  1. set up routine that detects changes in result created by Spark routine using dryworkflow / Makefiles
  2. load SparkSQL query result csv data, perform summary statistics and create data plots
  3. use html_fragment as export type in rmarkdown; this will create an HTML page without a header that can be used in Jekyll

Website (Jekyll)

  1. use import batch script to check for modified files in dryworkflow folders
  2. make a minor modification to the markdown file where the html_fragment is included to trigger jekyll update
  3. if jekyll serve is running, the contents of _site directory will be updated automatically
  4. export modified contents to sws_r_share/fbsmodules using rsync

Diagram re-processing workflow

Resources

The various source files for Jekyll and dryworkflow have been copied to a network location

Table of Contents